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Abstract 

As  examples  such  as  the  Monty  Hall  puzzle  show,  applying  conditioning  to  update  a 
probability  distribution  on  a  “naive  space” ,  which  does  not  take  into  account  the  proto¬ 
col  used,  can  often  lead  to  counterintuitive  results.  Here  we  examine  why.  A  criterion 
known  as  CAR  (“coarsening  at  random”)  in  the  statistical  literature  characterizes  when 
“naive”  conditioning  in  a  naive  space  works.  We  show  that  the  CAR  condition  holds  rather 
infrequently,  and  we  provide  a  procedural  characterization  of  it,  by  giving  a  randomized 
algorithm  that  generates  all  and  only  distributions  for  which  CAR  holds.  This  substantially 
extends  previous  characterizations  of  CAR.  We  also  consider  more  generalized  notions  of 
update  such  as  Jeffrey  conditioning  and  minimizing  relative  entropy  (AIRE).  We  give  a 
generalization  of  the  CAR  condition  that  characterizes  when  Jeffrey  conditioning  leads  to 
appropriate  answers,  and  show  that  there  exist  some  very  simple  settings  in  which  AIRE 
essentially  never  gives  the  right  results.  This  generalizes  and  interconnects  previous  results 
obtained  in  the  literature  on  CAR  and  AIRE. 

1.  Introduction 

Suppose  an  agent  represents  her  uncertainty  about  a  domain  using  a  probability  distribu¬ 
tion.  At  some  point,  she  receives  some  new  information  about  the  domain.  How  should  she 
update  her  distribution  in  the  light  of  this  information?  Conditioning  is  by  far  the  most 
common  method  in  case  the  information  comes  in  the  form  of  an  event.  However,  there  are 
numerous  well-known  examples  showing  that  naive  conditioning  can  lead  to  problems.  We 
give  just  two  of  them  here. 


Example  1.1:  The  Monty  Hall  puzzle  (Alosteller,  1965;  vos  Savant,  1990):  Suppose  that 
you’re  on  a  game  show  and  given  a  choice  of  three  doors.  Behind  one  is  a  car;  behind  the 
others  are  goats.  You  pick  door  1.  Before  opening  door  1,  Alonty  Hall,  the  host  (who  knows 
what  is  behind  each  door)  opens  door  3,  which  has  a  goat.  He  then  asks  you  if  you  still 
want  to  take  what’s  behind  door  1,  or  to  take  what’s  behind  door  2  instead.  Should  you 
switch?  Assuming  that,  initially,  the  car  was  equally  likely  to  be  behind  each  of  the  doors, 
naive  conditioning  suggests  that,  given  that  it  is  not  behind  door  3,  it  is  equally  likely  to  be 
behind  door  1  and  door  2.  Thus,  there  is  no  reason  to  switch.  However,  another  argument 
suggests  you  should  switch:  if  a  goat  is  behind  door  1  (which  happens  with  probability  2/3), 
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switching  helps;  if  a  car  is  behind  door  1  (which  happens  with  probability  1/3),  switching 
hurts.  Which  argument  is  right?  | 

Example  1.2:  The  three-prisoners  puzzle  (Bar-Hillel  &  Falk,  1982;  Gardner,  1961;  Mosteller, 
1965):  Of  three  prisoners  a,  6,  and  c,  two  are  to  be  executed,  but  a  does  not  know  which. 
Thus,  a  thinks  that  the  probability  that  i  will  be  executed  is  2/3  for  i  €  {a,b,c}.  He  says 
to  the  jailer,  “Since  either  b  or  c  is  certainly  going  to  be  executed,  you  will  give  me  no 
information  about  my  own  chances  if  you  give  me  the  name  of  one  man,  either  b  or  c,  who 
is  going  to  be  executed.”  But  then,  no  matter  what  the  jailer  says,  naive  conditioning  leads 
a  to  believe  that  his  chance  of  execution  went  down  from  2/3  to  1/2.  | 

There  are  numerous  other  well-known  examples  where  naive  conditioning  gives  what 
seems  to  be  an  inappropriate  answer,  including  the  two-children  puzzle  (Gardner,  1982; 
vos  Savant,  1996,  1994)  and  the  second-ace  puzzle  (Freund,  1965;  Shafer,  1985;  Halpern  & 
Tuttle,  1993). 1 

Why  does  naive  conditioning  give  the  wrong  answer  in  such  examples?  As  argued  by 
Halpern  and  Tuttle  (1993)  and  Shafer  (1985),  the  real  problem  is  that  we  are  not  condi¬ 
tioning  in  the  right  space.  If  we  work  in  a  larger  “sophisticated”  space,  where  we  take  the 
protocol  used  by  Monty  (in  Example  1.1)  and  the  jailer  (in  Example  1.2)  into  account,  con¬ 
ditioning  does  deliver  the  right  answer.  Roughly  speaking,  the  sophisticated  space  consists 
of  all  the  possible  sequences  of  events  that  could  happen  (for  example,  what  Monty  would 
say  in  each  circumstance,  or  what  the  jailer  would  say  in  each  circumstance),  with  their 
probability.2  However,  working  in  the  sophisticated  space  has  problems  too.  For  one  thing, 
it  is  not  always  clear  what  the  relevant  probabilities  in  the  sophisticated  space  are.  For 
example,  what  is  the  probability  that  the  jailer  says  b  if  b  and  c  are  to  be  executed?  Indeed, 
in  some  cases,  it  is  not  even  clear  what  the  elements  of  the  larger  space  are.  Moreover,  even 
when  the  elements  and  the  relevant  probabilities  are  known,  the  size  of  the  sophisticated 
space  may  become  an  issue,  as  the  following  example  shows. 

Example  1.3:  Suppose  that  a  world  describes  which  of  100  people  have  a  certain  disease. 
A  world  can  be  characterized  by  a  tuple  of  100  Os  and  Is,  where  the  ith  component  is  1 
iff  individual  i  has  the  disease.  There  are  2100  possible  worlds.  Further  suppose  that  the 
“agent”  in  question  is  a  computer  system.  Initially,  the  agent  has  no  information,  and 
considers  all  2100  worlds  equally  likely.  The  agent  then  receives  information  that  is  assumed 
to  be  true  about  which  world  is  the  actual  world.  This  information  comes  in  the  form  of 
statements  like  “individual  i  is  sick  or  individual  j  is  healthy”  or  “at  least  7  people  have  the 
disease” .  Each  such  statement  can  be  identified  with  a  set  of  possible  worlds.  For  example, 
the  statement  “at  least  7  people  have  the  disease”  can  be  identified  with  the  set  of  tuples 
with  at  least  7  Is.  For  simplicity,  assume  that  the  agent  is  given  information  saying  “the 
actual  world  is  in  set  17”,  for  various  sets  U.  Suppose  at  some  point  the  agent  has  been 

1.  Both  the  Monty  Hall  puzzle  and  the  two-children  puzzle  were  discussed  in  Ask  Marilyn ,  Marilyn  vos  Sa¬ 
vant’s  weekly  column  in  “Parade  Magazine”.  Of  all  Ask  Marilyn  columns  ever  published,  they  reportedly 
(vos  Savant,  1994)  generated  respectively  the  most  and  the  second-most  response. 

2.  The  notions  of  “naive  space”  and  “sophisticated  space”  will  be  formalized  in  Section  2.  This  introduction 
is  meant  only  to  give  an  intuitive  feel  for  the  issues. 
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told  that  the  actual  world  is  in  U\. ...  .  Un.  Then,  after  doing  conditioning,  the  agent  has  a 
uniform  probability  on  U\  fl . . .  O  Un. 

But  how  does  the  agent  keep  track  of  the  worlds  it  considers  possible?  It  certainly  will 
not  explicitly  list  them;  there  are  simply  too  many.  One  possibility  is  that  it  keeps  track 
of  what  it  has  been  told;  the  possible  worlds  are  then  the  ones  consistent  with  what  it  has 
been  told.  But  this  leads  to  two  obvious  problems:  checking  for  consistency  with  what  it 
has  been  told  may  be  hard,  and  if  it  has  been  told  n  things  for  large  n ,  remembering  them 
all  may  be  infeasible.  In  situations  where  these  two  problems  arise,  an  agent  may  not  be 
able  to  condition  appropriately.  | 

Example  1.3  provides  some  motivation  for  working  in  the  smaller,  more  naive  space.  Ex¬ 
amples  1.1  and  1.2  show  that  this  is  not  always  appropriate.  Thus,  an  obvious  question  is 
when  it  is  appropriate.  It  turns  out  that  this  question  is  highly  relevant  in  the  statistical 
areas  of  selectively  reported  data  and  missing  data.  Originally  studied  within  these  contexts 
(Rubin,  1976;  Dawid  &  Dickey,  1977),  it  was  later  found  that  it  also  plays  a  fundamental 
role  in  the  statistical  work  on  survival  analysis  (Kleinbaum,  1999).  Building  on  previous  ap¬ 
proaches,  Heitjan  and  Rubin  (1991)  presented  a  necessary  and  sufficient  condition  for  when 
conditioning  in  the  “naive  space”  is  appropriate.  Nowadays  this  so-called  CAR  ( Coarsening 
at  Random)  condition  is  an  established  tool  in  survival  analysis,  (for  overviews,  see  (Gill, 
van  der  Laan,  &  Robins,  1997;  Nielsen,  1998).)  We  examine  this  criterion  in  our  own,  rather 
different  context,  and  show  that  it  applies  rather  rarely.  Specifically,  we  show  that  there  are 
realistic  settings  where  the  sample  space  is  structured  in  such  a  way  that  it  is  impossible  to 
satisfy  CAR,  and  we  provide  a  criterion  to  help  determine  whether  or  not  this  is  the  case. 
We  also  give  a  procedural  characterization  of  the  CAR  condition,  by  giving  a  randomized 
algorithm  that  generates  all  and  only  distributions  for  which  CAR  holds,  thereby  solving 
an  open  problem  posed  by  Gill  et  al.  (1997). 

We  then  show  that  the  situation  is  worse  if  the  information  does  not  come  in  the  form 
of  an  event.  For  that  case,  several  generalizations  of  conditioning  have  been  proposed.  Per¬ 
haps  the  best  known  are  Jeffrey  conditioning  (Jeffrey,  1968)  (also  known  as  Jeffrey’s  rule ) 
and  Minimum  Relative  Entropy  (MRE)  Updating  (Kullback,  1959;  Csiszar,  1975;  Shore  & 
Johnson,  1980)  (also  known  as  cross-entropy ).  Jeffrey  conditioning  is  a  generalization  of 
ordinary  conditioning;  AIRE  updating  is  a  generalization  of  Jeffrey  conditioning. 

We  show  that  Jeffrey  conditioning,  when  applicable,  can  be  justified  under  an  appro¬ 
priate  generalization  of  the  CAR  condition.  Although  it  has  been  argued,  using  mostly 
axiomatic  characterizations,  that  AIRE  updating  (and  hence  also  Jeffrey  conditioning)  is, 
when  applicable,  the  only  reasonable  way  to  update  probability  (see,  e.g.,  (Csiszar,  1991; 
Shore  &  Johnson,  1980)),  it  is  well  known  that  there  are  situations  where  applying  AIRE 
leads  to  paradoxical,  highly  counterintuitive  results  (Hunter,  1989;  Seidenfeld,  1986;  van 
Fraassen,  1981). 

Example  1.4:  Consider  the  Judy  Benjamin  problem  (van  Fraassen,  1981):  Judy  is  lost  in  a 
region  that  is  divided  into  two  halves,  Blue  and  Red  territory,  each  of  which  is  further  divided 
into  Headquarters  Company  area  and  Second  Company  area.  A  priori,  Judy  considers  it 
equally  likely  that  she  is  in  any  of  these  four  quadrants.  She  contacts  her  own  headquarters 
by  radio,  and  is  told  “I  can’t  be  sure  where  you  are.  If  you  are  in  Red  territory,  the  odds  are 
3:1  that  you  are  in  HQ  Company  area  ...”  At  this  point  the  radio  gives  out.  AIRE  updating 
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on  this  information  leads  to  a  distribution  where  the  posterior  probability  of  being  in  Blue 
territory  is  greater  than  1/2.  Indeed,  if  HQ  had  said  “If  you  are  in  Red  territory,  the  odds 
are  a  :  1  that  you  are  in  HQ  company  area  then  for  all  a  ^  1,  according  to  MRE 

updating,  the  posterior  probability  of  being  in  Blue  territory  is  always  greater  than  1/2.  | 

Grove  and  Halpern  (1997)  provide  a  “sophisticated  space”  where  conditioning  gives 
what  is  arguably  the  more  intuitive  answer  in  the  Judy  Benjamin  problem,  namely  that 
if  HQ  sends  a  message  of  the  form  “if  you  are  in  Red  territory,  then  the  odds  are  a  :  1 
that  you  are  in  HQ  company  area”  then  Judy’s  posterior  probability  of  being  in  each  of  the 
two  quadrants  in  Blue  remains  at  1/4.  Seidenfeld  (1986),  strengthening  results  of  Friedman 
and  Shimony  (1971),  showed  that  there  is  no  sophisticated  space  in  which  conditioning  will 
give  the  same  answer  as  MRE  in  this  case.  (See  also  (Dawid,  2001),  for  similar  results 
along  these  lines.)  We  strengthen  these  results  by  showing  that,  even  in  a  class  of  much 
simpler  situations  (where  Jeffrey  conditioning  cannot  be  applied) ,  using  MRE  in  the  naive 
space  corresponds  to  conditioning  in  the  sophisticated  space  in  essentially  only  trivial  cases. 
These  results  taken  together  show  that  generally  speaking,  working  with  the  naive  space, 
while  an  attractive  approach,  is  likely  to  give  highly  misleading  answers.  That  is  the  main 
message  of  this  paper. 

We  remark  that,  although  there  are  certain  similarities,  our  results  are  quite  different 
in  spirit  from  the  well-known  results  of  Diaconis  and  Zabell  (1986).  They  considered  when 
a  posterior  probability  could  be  viewed  as  the  result  of  conditioning  a  prior  probability 
on  some  larger  space.  By  way  of  contrast,  we  have  a  fixed  larger  space  in  mind  (the 
“sophisticated  space” ) ,  and  are  interested  in  when  conditioning  in  the  naive  space  and  the 
sophisticated  space  agree. 

It  is  also  worth  stressing  that  the  distinction  between  the  naive  and  the  sophisticated 
space  is  entirely  unrelated  to  the  philosophical  view  that  one  has  of  probability  and  how  one 
should  do  probabilistic  inference.  For  example,  the  probabilities  in  the  Monty  Hall  puzzle 
can  be  viewed  as  the  participant’s  subjective  probabilities  about  the  location  of  the  car  and 
about  what  Monty  will  say  under  what  circumstances;  alternatively,  they  can  be  viewed  as 
“frequentist”  probabilities,  inferred  from  watching  the  Monty  Hall  show  on  television  for 
many  weeks  and  then  setting  the  probabilities  equal  to  observed  frequencies.  The  problem 
we  address  occurs  both  from  a  frequentist  and  from  a  subjective  stance. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2  we  formalize  the  notion  of 
naive  and  sophisticated  spaces.  In  Section  3,  we  consider  the  case  where  the  information 
comes  in  the  form  of  an  event.  We  describe  the  CAR  condition  and  show  that  it  is  violated 
in  a  general  setting  of  which  the  Monty  Hall  and  three-prisoners  puzzle  are  special  cases. 
In  Section  4  we  give  several  characterizations  of  CAR.  We  supply  conditions  under  which 
it  is  guaranteed  to  hold  and  guaranteed  not  to  hold,  and  we  give  a  randomized  algorithm 
that  generates  all  and  only  distributions  for  which  CAR  holds.  In  Section  5  we  consider 
the  case  where  the  information  is  not  in  the  form  of  an  event.  We  first  consider  situations 
where  Jeffrey  conditioning  can  be  applied.  We  show  that  Jeffrey  conditioning  in  the  naive 
space  gives  the  appropriate  answer  iff  a  generalized  CAR  condition  holds.  We  then  show 
that,  typically,  applying  MRE  in  the  naive  space  does  not  give  the  appropriate  answer.  We 
conclude  with  some  discussion  of  the  implication  of  these  results  in  Section  6. 
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2.  Naive  vs.  Sophisticated  Spaces 

Our  formal  model  is  a  special  case  of  the  multi-agent  systems  framework  (Halpern  &  Fagin, 
1989),  which  is  essentially  the  same  as  that  used  by  Friedman  and  Halpern  (1997)  to  model 
belief  revision.  We  assume  that  there  is  some  external  world  in  a  set  W,  and  an  agent  who 
makes  observations  or  gets  information  about  that  world.  We  can  describe  the  situation 
by  a  pair  ( w,l ),  where  w  G  W  is  the  actual  world,  and  l  is  the  agent’s  local  state ,  which 
essentially  characterizes  her  information.  W  is  what  we  called  the  “naive  space”  in  the 
introduction.  For  the  purposes  of  this  paper,  we  assume  that  l  has  the  form  (o\, . . .  ,on), 
where  Oj  is  the  observation  that  the  agent  makes  at  time  j,  j  =  1, . . . ,  n.  This  representation 
implicitly  assumes  that  the  agent  remembers  everything  she  has  observed  (since  her  local 
state  encodes  all  the  previous  observations).  Thus,  we  ignore  memory  issues  here.  We 
also  ignore  computational  issues,  just  so  as  to  be  able  to  focus  on  when  conditioning  is 
appropriate. 

A  pair  (w,  {o\.. . . .  on ))  is  called  a  run.  A  run  may  be  viewed  as  a  complete  description 
of  what  happens  over  time  in  one  possible  execution  of  the  system.  For  simplicity,  in  this 
paper,  we  assume  that  the  state  of  the  world  does  not  change  over  time.  The  “sophisticated 
space”  is  the  set  of  all  possible  runs. 

In  the  Monty  Hall  puzzle,  the  naive  space  has  three  worlds,  representing  the  three  pos¬ 
sible  locations  of  the  car.  The  sophisticated  space  describes  what  Monty  would  have  said 
in  all  circumstances  (i.e.,  Monty’s  protocol)  as  well  as  where  the  car  is.  The  three-prisoners 
puzzle  is  treated  in  detail  in  Example  2.1  below.  While  in  these  cases  the  sophisticated 
space  is  still  relatively  simple,  this  is  no  longer  the  case  for  the  Judy  Benjamin  puzzle. 
Although  the  naive  space  has  only  four  elements,  constructing  the  sophisticated  space  in¬ 
volves  considering  all  the  things  that  HQ  could  have  said,  which  is  far  from  clear,  and  the 
conditions  under  which  HQ  says  any  particular  thing.  Grove  and  Halpern  (1997)  discuss 
the  difficulties  in  constructing  such  a  sophisticated  space. 

In  general,  not  only  is  it  not  clear  what  the  sophisticated  space  is,  but  the  need  for 
a  sophisticated  space  and  the  form  it  must  take  may  become  clear  only  after  the  fact. 
For  example,  in  the  Judy  Benjamin  problem,  before  contacting  headquarters,  Judy  would 
almost  certainly  not  have  had  a  sophisticated  space  in  mind  (even  assuming  she  was  an 
expert  in  probability) ,  and  could  not  have  known  the  form  it  would  have  to  take  until  after 
hearing  headquarter’s  response. 

In  any  case,  if  the  agent  has  a  prior  probability  on  the  set  1Z  of  possible  runs  in  the 
sophisticated  space,  after  hearing  or  observing  (o\. . . .  .  oQ.  she  can  condition,  to  get  a 
posterior  on  1Z.  Formally,  the  agent  is  conditioning  her  prior  on  the  set  of  runs  where  her 
local  state  at  time  k  is  (oi, . . . ,  oQ, 

Clearly  the  agent’s  probability  Pr  on  1Z  induces  a  probability  Pi’w  on  W  by  marginaliza¬ 
tion.  We  are  interested  in  whether  the  agent  can  compute  her  posterior  on  W  after  observing 
(oi, . . . ,  Ofc)  in  a  relatively  simple  way,  without  having  to  work  in  the  sophisticated  space. 

Example  2.1:  Consider  the  three-prisoners  puzzle  in  more  detail.  Here  the  naive  space  is 
W  =  {wa.Wh.Wc},  where  wx  is  the  world  where  x  is  not  executed.  We  are  only  interested 
in  runs  of  length  1,  so  n  =  1.  The  set  O  of  observations  (what  agent  can  be  told)  is 
{{ica,  Wb},  {wa,wc}}.  Here  u{wa,Wb}v  corresponds  to  the  observation  that  either  a  or  b  will 
not  be  executed  (i.e.,  the  jailer  saying  “c  will  be  executed”);  similarly,  {wa,wc}  corresponds 
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to  the  jailer  saying  “6  will  be  executed”.  The  sophisticated  space  consists  of  the  four  runs 

{wa,  ({Wa,Wb})),  ( Wa ,  ({wa,  Wc})) ,  {wb,  ({wa,Wb})),  ( Wc ,  ({WasWc}))- 

Note  that  there  is  no  run  with  observation  ({wb,  wc}),  since  the  jailer  will  not  tell  a  that  he 
will  be  executed. 

According  to  the  story,  the  prior  Pr^v  in  the  naive  space  has  Pr wi'w)  =  1/3  for  w  G  W. 
The  full  distribution  Pr  on  the  runs  is  not  completely  specified  by  the  story.  In  particular, 
we  are  not  told  the  probability  with  which  the  jailer  will  say  b  and  c  if  a  will  not  be  executed. 
We  return  to  this  point  in  Example  3.2.  | 

3.  The  CAR  Condition 

A  particularly  simple  setting  is  where  the  agent  observes  or  learns  that  the  external  world 
is  in  some  set  U  C  W .  For  simplicity,  we  assume  throughout  this  paper  that  the  agent 
makes  only  one  observation,  and  makes  it  at  the  first  step  of  the  run.  Thus,  the  set  O  of 
possible  observations  consists  of  nonempty  subsets  of  W.  Thus,  any  run  r  can  be  written 
as  r  =  (w,  ( U ))  where  w  is  the  actual  world  and  U  is  a  nonempty  subset  of  W .  However,  O 
does  not  necessarily  consist  of  all  the  nonempty  subsets  of  W .  Some  subsets  may  never  be 
observed.  For  example,  in  Example  2.1,  a  is  never  told  that  he  will  be  executed,  so  {wb,  wc} 
is  not  observed.  We  assume  that  the  agent’s  observations  are  accurate,  in  that  if  the  agent 
observes  U  in  a  run  r,  then  the  actual  world  in  r  is  in  U.  That  is,  we  assume  that  all  runs 
are  of  the  form  r  =  (w,  (U))  where  w  G  U.  In  Example  2.1,  accuracy  is  enforced  by  the 
requirement  that  runs  have  the  form  (wx,  ({wx,wy})). 

The  observation  or  information  obtained  does  not  have  to  be  exactly  of  the  form  “the 
actual  world  is  in  17”.  It  suffices  that  it  is  equivalent  to  such  a  statement.  This  is  the  case 
in  both  the  Monty  Hall  puzzle  and  the  three-prisoners  puzzle.  For  example,  in  the  three- 
prisoners  puzzle,  being  told  that  b  will  be  executed  is  essentially  equivalent  to  observing 
{wa,  wc}  (either  a  or  c  will  not  be  executed). 

In  this  setting,  we  can  ask  whether,  after  observing  U,  the  agent  can  compute  her 
posterior  on  W  by  conditioning  on  U.  Roughly  speaking,  this  amounts  to  asking  whether 
observing  U  is  the  same  as  discovering  that  U  is  true.  This  may  not  be  the  case  in  general — 
observing  or  being  told  U  may  carry  more  information  than  just  the  fact  that  U  is  true. 
For  example,  if  for  some  reason  a  knows  that  the  jailer  would  never  say  c  if  he  could  help 
it  (so  that,  in  particular,  if  b  and  c  will  be  executed,  then  he  will  definitely  say  6),  then 
hearing  c  (i.e.,  observing  {wa,wb})  tells  a  much  more  than  the  fact  that  the  true  world  is 
one  of  wa  or  wb.  It  says  that  the  true  world  must  be  wb  (for  if  the  true  world  were  wa .  the 
jailer  would  have  said  b). 

In  the  remainder  of  this  paper  we  assume  that  W  is  finite.  For  every  scenario  we  consider 
we  define  a  set  of  possible  observations  O,  consisting  of  nonempty  subsets  of  W .  For  given 
W  and  O,  the  set  of  runs  1Z  is  then  defined  to  be  the  set 

n  =  {{w,(U))\  U  G  0,w  G  f/}. 

Given  our  assumptions  that  the  state  does  not  change  over  time  and  that  the  agent  makes 
only  one  observation,  the  set  1Z  of  runs  can  be  viewed  as  a  subset  of  W  x  O.  While 
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just  taking  7?.  to  be  a  subset  of  W  x  O  would  slightly  simplify  the  presentation  here,  in 
general,  we  certainly  want  to  allow  sequences  of  observations.  (Consider,  for  example,  an 
rr-door  version  of  the  Monty  Hall  problem,  where  Monty  opens  a  sequence  of  doors.)  This 
framework  extends  naturally  to  that  setting. 

Whenever  we  speak  of  a  distribution  Pr  on  IZ.  we  implicitly  assume  that  the  probability 
of  any  set  on  which  we  condition  is  strictly  greater  than  0.  Let  Xw  and  Xo  be  two  random 
variables  on  IZ.  where  Xw  is  the  actual  world  and  Xq  is  the  observed  event.  Thus,  for 
r  =  (w,  (17)),  Xw(r)  =  w  and  Xo{r)  =  U.  Given  a  distribution  Pr  on  runs  IZ.  we  denote 
by  Pvw  the  marginal  distribution  of  Xw,  and  by  Pro  the  marginal  distribution  of  Xq- 
For  example,  for  V,  U  C  W,  Pi'w(P)  is  short  for  Pr(Xw  G  V)  and  Pi ’w(Y  \  U)  is  short  for 
Pr(Xw  G  V  |  Xw  G  17). 

Let  Pr  be  a  prior  on  IZ  and  let  Pr7  =  Pr(-  |  Xo  =  U)  be  the  posterior  after  observing  U. 
The  main  question  we  ask  in  this  paper  is  under  what  conditions  we  have 

Pv'w(V)  =  Pvw(V\U)  (1) 

for  all  V  C  W .  That  is,  we  want  to  know  under  what  conditions  the  posterior  W  induced  by 
Pr7  can  be  computed  from  the  prior  on  W  by  conditioning  on  the  observation.  (Example  3.2 
below  gives  a  concrete  case.)  We  stress  that  Pr  and  Pr7  are  distributions  on  7 Z,  while  Pr w 
and  Pr7^  are  distributions  on  W  (obtained  by  marginalization  from  Pr  and  Pr7,  respectively). 
Note  that  (1)  is  equivalently  stated  as 

Pr(Xw  =  w  \X0  =  U)=  Pv{Xw  =  w  |  Xw  G  U)  for  all  w  G  U.  (2) 

(1)  (equivalently,  (2))  is  called  the  “CAR  condition”.  It  can  be  characterized  as  follows3: 

Theorem  3.1:  (Gill  et  al,  1997)  Fix  a  probability  Pr  on  7 Z  and  a  set  U  C  W.  The 
following  are  equivalent: 

(a)  If  Pr(XG  =  17)  >  0,  then  Pr(Xw  =  w  \XQ  =  U)  =  Pv{Xw  =w\Xw€U)  for  all 
w  G  U. 

(b)  The  event  Xw  =  w  is  independent  of  the  event  Xo  =  U  given  Xw  G  U,  for  all  w  G  17. 

(c)  Pr(Ao  =  U  |  Xw  =  w)  =  Pr(Xo  =  U  \  Xw  G  U )  for  all  w  G  U  such  that  Pr(X^  = 
w)  >  0. 

(d)  Pr(Xo  =  U  \  Xw  =  w )  =  Pr(Ao  =  U  \  Xw  =  w1)  for  all  w,w'  G  U  such  that 
Pr  [Xw  =  w )  >  0  and  Pr  (Xw  =  w ')  >  0. 

For  completeness  (and  because  it  is  useful  for  our  later  Theorem  5.1),  we  provide  a  proof 
of  Theorem  3.1  in  the  appendix. 

The  first  condition  in  Theorem  3.1  is  just  (2).  The  third  and  fourth  conditions  justify 
the  name  “coarsening  at  random”.  Intuitively,  first  some  world  w  G  W  is  realized,  and  then 

3.  Just  after  this  paper  was  accepted,  we  learned  that  there  really  exist  two  subtly  different  versions  of  the 
CAR  condition:  “weak”  CAR  and  “strong”  CAR.  What  is  called  CAR  in  this  paper  is  really  “weak” 
CAR.  Gill,  van  der  Laan,  and  Robins  (1997)  implicitly  use  the  “strong”  definition  of  CAR.  Indeed,  the 
statement  and  proof  of  Theorem  3.1  (about  weak  CAR)  are  very  slightly  different  from  the  corresponding 
statement  by  Gill  et  al.  (1997)  (about  strong  CAR);  the  difference  is  explained  in  Section  6. 
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some  “coarsening  mechanism”  decides  which  event  U  C  W  such  that  w  G  U  is  revealed 
to  the  agent.  The  event  U  is  called  a  “coarsening”  of  w.  The  third  and  fourth  conditions 
effectively  say  that  the  probability  that  w  is  coarsened  to  U  is  the  same  for  all  w  G  U.  This 
means  that  the  “coarsening  mechanism”  is  such  that  the  probability  of  observing  U  is  not 
affected  by  the  specific  value  of  w  G  U  that  was  realized. 

In  the  remainder  of  this  paper,  when  we  say  “Pr  satisfies  CAR”,  we  mean  that  Pr 
satisfies  condition  (a)  of  Theorem  3.1  (or,  equivalently,  any  of  the  other  three  conditions) 
for  all  U  G  O.  Thus,  “Pr  satisfies  CAR”  means  that  conditioning  in  the  naive  space  W 
coincides  with  conditioning  in  the  sophisticated  space  1Z  with  probability  1.  The  CAR 
condition  explains  why  conditioning  in  the  naive  space  is  not  appropriate  in  the  Monty  Hall 
puzzle  or  the  three-prisoners  puzzle.  We  consider  the  three-prisoners  puzzle  in  detail;  a 
similar  analysis  applies  to  Monty  Hall. 

Example  3.2:  In  the  three-prisoners  puzzle,  what  is  a's  prior  distribution  Pr  on  7Z‘!  In 
Example  2.1  we  assumed  that  the  marginal  distribution  Piqv  on  W  is  uniform.  Apart  from 
this,  Pr  is  unspecified.  Now  suppose  that  a  observes  {wa,wc}  (“the  jailer  says  6”).  Naive 
conditioning  would  lead  a  to  adopt  the  distribution  Pr^(-  j  {wa,  wc})-  This  distribution 
satisfies  Pr^/(rca  !  wc})  =  1/2.  Sophisticated  conditioning  leads  a  to  adopt  the  dis¬ 
tribution  Pr/  =  Pr(-  |  Xo  =  {wa,  icc}).  By  part  (d)  of  Theorem  3.1,  naive  conditioning  is 
appropriate  (i.e.,  Pr(v  =  Pr^(-  |  {'R>a.'uy  }))  only  if  the  jailer  is  equally  likely  to  say  b  in 
both  worlds  wa  and  wc.  Since  the  jailer  must  say  that  b  will  be  executed  in  world  wc,  it 
follows  that  Pr(Ao  =  {wa,wc}\Xw  =  wc)  =  1.  Thus,  conditioning  is  appropriate  only  if 
the  jailer’s  protocol  is  such  that  he  definitely  says  b  in  wa,  i.e.,  even  if  both  b  and  c  are 
executed.  But  if  this  is  the  case,  when  the  jailer  says  c,  conditioning  Pr^  on  {wa,  Wb}  is  not 
appropriate,  since  then  a  knows  that  he  will  be  executed.  The  world  cannot  be  wa,  for  then 
the  jailer  would  have  said  b.  Therefore,  no  matter  what  the  jailer’s  protocol  is,  conditioning 
in  the  naive  space  cannot  coincide  with  conditioning  in  the  sophisticated  space  for  both  of 
his  responses.  | 

The  following  example  shows  that  in  general,  in  settings  of  the  type  arising  in  the  Monty 
Hall  and  the  three-prisoners  puzzle,  the  CAR  condition  can  only  be  satisfied  in  very  special 
cases: 

Example  3.3:  Suppose  that  O  =  { U\ .  U2},  and  both  Ui  and  U2  are  observed  with  positive 
probability.  (This  is  the  case  for  both  Monty  Hall  and  the  three-prisoners  puzzle.)  Then 
the  CAR  condition  (Theorem  3.1(c))  cannot  hold  for  both  U\  and  U2  unless  Pr(Aw  G 
U 1  fl  U2)  is  either  0  or  1.  For  suppose  that  Pr(Ao  =  U\)  >  0,  Pi’(Aq  =  U2 )  >  0,  and 
0  <  Pr(Aw  G  U\  fl  U2)  <  1.  Without  loss  of  generality,  there  is  some  w\  G  I7i  —  U2  and 
W2  G  U\  fl  U2  such  that  Pr(Aw  =  aq)  >  0  and  Pr(Aw  =  W2 )  >  0.  Since  observations  are 
accurate,  we  must  have  Pr(Ao  =  U\  j  Xw  =  R-'i)  =  1.  If  CAR  holds  for  U\,  then  we  must 
have  Pr(Ao  =  U\  j  Xw  =  W2)  =  1.  But  then  Pr(Ao  =  U2  \  X\y  =  W2 )  =  0.  But  since 
Pr(Ao  =  U2)  >  0,  it  follows  that  there  is  some  W3  G  U2  such  that  Pr(Aw  =  W3)  >  0  and 
Pr(Ao  =  U2  |  X\y  =  W3 )  >  0.  This  contradicts  the  CAR  condition.  | 

So  when  does  CAR  hold?  The  previous  example  exhibited  a  combination  of  O  and  W  for 
which  CAR  can  only  be  satisfied  in  “degenerate”  cases.  In  the  next  section,  we  shall  study 
this  question  for  arbitrary  combinations  of  O  and  W. 
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4.  Characterizing  CAR 

In  this  section,  we  provide  some  characterizations  of  when  the  CAR  condition  holds,  for 
finite  O  and  W.  Our  results  extend  earlier  results  of  Gill,  van  der  Laan,  and  Robins  (1997). 
We  first  exhibit  a  simple  situation  in  which  CAR  is  guaranteed  to  hold,  and  we  show  that 
this  is  the  only  situation  in  which  it  is  guaranteed  to  hold.  We  then  show  that,  for  arbitrary 
O  and  W,  we  can  construct  a  0- 1-valued  matrix  from  which  a  strong  necessary  condition 
for  CAR  to  hold  can  be  obtained.  It  turns  out  that,  in  some  cases  of  interest,  CAR  is 
(roughly  speaking)  guaranteed  not  to  hold  except  in  “degenerate”  situations.  Finally,  we 
introduce  a  new  “procedural”  characterization  of  CAR:  we  provide  a  mechanism  such  that 
a  distribution  Pr  can  be  thought  of  as  arising  from  the  mechanism  if  and  only  if  Pr  satisfies 
CAR. 

4.1  When  CAR  is  Guaranteed  to  Hold 

We  first  consider  the  only  situation  where  CAR  is  guaranteed  to  hold:  if  the  sets  in  O  are 
pairwise  disjoint. 


Proposition  4.1:  The  CAR  condition  holds  for  all  distributions  Pr  on  7Z  if  and  only  if  O 
consists  of  pairwise  disjoint  subsets  ofW. 


What  happens  if  the  sets  in  O  are  not  pairwise  disjoint?  Are  there  still  cases  (combi¬ 
nations  of  O,  W ,  and  distributions  on  77)  when  CAR  holds?  There  are,  but  they  are  quite 
special. 


4.2  When  CAR  May  Hold 


We  now  present  a  lemma  that  provides  a  new  characterization  of  CAR  in  terms  of  a  simple 
0/1-matrix.  The  lemma  allows  us  to  determine  for  many  combinations  of  O  and  W,  whether 
a  distribution  on  1Z  exists  that  satisfies  CAR  and  gives  certain  worlds  positive  probability. 

Fix  a  set  77  of  runs,  whose  worlds  are  in  some  finite  set  W  and  whose  observations  come 
from  some  finite  set  O  =  {Ui, . . . ,  Un}.  We  say  that  A  C  W  is  an  7 Z-atom  relative  to  W  and 
O  if  A  has  the  form  V\  D  . . .  DVni  where  each  V)  is  either  Ui  or  Ui,  and  {r  G  77  :  Xw(r)  G 
A}  A  0.  Let  A  =  {Ai, . . .  ,  Am}  be  the  set  of  77-atoms  relative  to  W  and  O.  We  can  think 
of  A  as  a  partition  of  the  worlds  according  to  what  can  be  observed.  Two  worlds  w i  and 
W‘2  are  in  the  same  set  A,-  G  A  if  there  are  no  observations  that  distinguish  them;  that  is, 
there  is  no  observation  U  G  O  such  that  w\  G  U  and  W2  (7  U.  Define  the  m  x  n  matrix  S 
with  entries  Sy  as  follows: 


sri  ~ 


1  if  At  C  Uj 
0  otherwise. 


(3) 


We  call  S  the  CARacterizing  matrix  (for  O  and  W).  Note  that  each  row  i  in  S  corresponds 
to  a  unique  atom  in  A;  we  call  this  the  atom  corresponding  to  row  i.  This  matrix  (actually, 
its  transpose)  was  first  introduced  (but  for  a  different  purpose)  by  Gill  et  al.  (1997). 
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Example  4.2:  Returning  to  Example  3.3,  the  CARacterizing  matrix  is  given  by 


where  the  columns  correspond  to  U\  and  U2  and  the  rows  correspond  to  the  three  atoms 
U\  —  U2 ,  U\  fl  U2  and  U2  —  U\ .  For  example,  the  fact  that  entry  S31  of  this  matrix  is  0 
indicates  that  U\  cannot  be  observed  if  the  actual  world  w  is  in  U2  —  U\.  I 

In  the  following  lemma,  'jT  denotes  the  transpose  of  the  (row)  vector  7,  and  1  denotes  the 
row  vector  consisting  of  all  Is. 

Lemma  4.3:  Let  IZ  be  the  set  of  runs  over  observations  O  and  worlds  W,  and  let  S  be  the 
CARacterizing  matrix  for  O  and  W. 

(a)  Let  Pr  be  any  distribution  over  IZ  and  let  S'  be  the  matrix  obtained  by  deleting  from 
S  all  rows  corresponding  to  an  atom  A  with  Pr(Xw  G  A)  =  0.  Define  the  vector 
7  =  (71,  •  •  •  ,7 n)  by  setting  7 j  =  Pr(Xo  =  Uj  \  Xw  G  Uj)  if  Pr(AV  G  Uj)  >  0,  and 
7 j  =  0  otherwise,  for  j  =  1 ...  n.  If  Dr  satisfies  CAR,  then  S'  ■  7T  =  1T . 

(b)  Let  S'  be  a  matrix  consisting  of  a  subset  of  the  rows  of  S,  and  let  Vw,S'  be  the  set  of 
distributions  over  XV  with  support  corresponding  to  S' ;  i.e., 

Vw,S'  =  {P\v  I  Dw(A)  >  0  iff  A  corresponds  to  a  row  in  S'}. 

If  there  exists  a  vector  7  >  0  such  that  S'  ■  ~/T  =  1T ,  then,  for  all  Pw  G  Vw,s\  g  there 
exists  a  distribution  Pr  over  IZ  with  Pi ?w  =  Pw  (i-e.,  the  marginal  of  Pr  on  XV  is 
Pw)  such  that  (a)  Pr  satisfies  CAR  and  (b)  Pr (Xq  =  Uj  j  Xw  G  Uj)  =  7 j  for  all  j 
with  Pr  {Xw  G  Uj  )  >  0. 

Note  that  (b)  is  essentially  a  converse  of  (a).  A  natural  question  to  ask  is  whether  (b)  would 
still  hold  if  we  replaced  “for  all  Pw  G  Vw.S1  there  exists  Pr  satisfying  CAR  with  Pi'w  =  Pw" 
by  “for  all  distributions  Po  over  O  there  exists  Pr  satisfying  CAR  with  Pro  =  Po”  The 
answer  is  no;  see  Example  4.6(b)  (ii). 

Lemma  4.3  says  that  a  distribution  Pr  that  satisfies  CAR  and  at  the  same  time  has 
Pr  {Xw  G  A)  >  0  for  m  different  atoms  A  can  exist  if  and  only  if  a  certain  set  of  m  linear 
equations  in  n  unknowns  has  a  solution.  In  many  situations  of  interest,  m>  n  (note  that  m 
may  be  as  large  as  2"  —  1).  Not  surprisingly  then,  in  such  situations  there  often  can  be  no 
distribution  Pr  that  satisfies  CAR,  as  we  show  in  the  next  subsection.  On  the  other  hand, 
if  the  set  of  equations  S'P1  =  1  does  have  a  solution  in  7,  then  the  set  of  all  solutions  forms 
the  intersection  of  an  affine  subspace  (i.e.  a  hyperplane)  of  R™  and  the  positive  orthant 
[0,  oo)n.  These  solutions  are  just  the  conditional  probabilities  Pr  (Xq  =  Uj  j  Xw  G  Uj)  for  all 
distributions  for  which  CAR  holds  that  have  support  corresponding  to  S' .  These  conditional 
probabilities  may  then  be  extended  to  a  distribution  over  IZ  by  setting  Pi ?w  =  Pw  for  an 
arbitrary  distribution  Pw  over  the  worlds  in  atoms  corresponding  to  S' ;  all  Pr  constructed 
in  this  way  satisfy  CAR. 
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Summarizing,  we  have  the  remarkable  fact  that  for  any  given  set  of  atoms  A  there  are 
only  two  possibilities:  either  no  distribution  exists  which  has  Pr(Xw  €  A)  >  0  for  all  A  G  A 
and  satisfies  CAR,  or  for  all  distributions  Pw  over  worlds  corresponding  to  atoms  in  A, 
there  exists  a  distribution  satisfying  CAR  with  marginal  distribution  over  worlds  equal  to 
Pw- 

4.3  When  CAR  is  Guaranteed  Not  to  Hold 

We  now  present  a  theorem  that  gives  two  explicit  and  easy-to-check  sufficient  conditions 
under  which  CAR  cannot  hold  unless  the  probabilities  of  some  atoms  and/or  observations 
are  0.  The  theorem  is  proved  by  showing  that  the  condition  of  Lemma  4.3(a)  cannot  hold 
under  the  stated  conditions. 

We  briefly  recall  some  standard  definitions  from  linear  algebra.  A  set  of  vectors  v\ , . . . ,  vm 
is  called  linearly  dependent  if  there  exist  coefficients  Ai,...,Am  (not  all  zero)  such  that 
YaL i  \Vi  =  0;  the  vectors  are  affinely  dependent  if  there  exist  coefficients  Ai, ... . ,  Xm  (not 
all  zero)  such  that  YaL\  Aya,  =  0  and  l  \  =  0.  A  vector  u  is  called  an  affine  combination 
of  ci, ...  ,  vm  if  there  exist  coefficients  Ai, . . . ,  Xm  such  that  Ya==  i  A fdi  =  u  and  Yw=i  \  =  0. 

Theorem  4.4:  Let  IZ  be  a  set  of  runs  over  observations  O  =  {U\. . . . ,  Un}  and  worlds  W , 
and  let  S  be  the  CARacterizing  matrix  for  O  and  W. 

(a)  Suppose  that  there  exists  a  subset  R  of  the  rows  in  S  and  a  vector  u  =  (ui, . . .  ,un) 
that  is  an  affine  combination  of  the  rows  of  R  such  that  uj  >  0  for  all  j  €  {1, . . . ,  n} 
and  Uj *  >  0  for  some  j*  €  {1,... , n}.  Then  there  is  no  distribution  Pr  on  IZ  that 
satisfies  CAR  such  that  Pr(Xo  =  Uj*)  >  0  and  Pr (Xw  €  A)  >  0  for  each  IZ-atom  A 
corresponding  to  a  row  in  R. 

(b)  If  there  exists  a  subset  R  of  the  rows  of  S  that  is  linearly  dependent  but  not  affinely 
dependent,  then  there  is  no  distribution  Pr  on  IZ  that  satisfies  CAR  such  that  Pr  (Xw  € 
A)  >  0  for  each  IZ-atom  A  corresponding  to  a  row  in  R. 

( c )  Given  a  set  R  consisting  of  n  linearly  independent  rows  of  S  and  a  distribution  Pw 
on  W  such  that  Pw{A)  >  0  for  all  A  corresponding  to  a  row  in  R,  there  is  a  unique 
distribution  Po  on  O  such  that  if  Pr  is  a  distribution  on  IZ  satisfying  CAR  and 
Pr  (Xw  €  A)  =  Pw{A )  for  each  atom  A  corresponding  to  a  row  in  R,  then  Pr(Ao  = 
U)  =  P0(U). 

It  is  well  known  that  in  an  m  x  n  matrix,  at  most  n  rows  can  be  linearly  independent. 
In  many  cases  of  interest  (cf.  Example  4.5  below),  the  number  of  atoms  m  is  larger  than  the 
number  of  observations  n,  so  that  there  must  exist  subsets  R  of  rows  of  S  that  are  linearly 
dependent.  Thus,  part  (b)  of  Theorem  4.4  puts  nontrivial  constraints  on  the  distributions 
that  satisfy  CAR. 

The  requirement  in  part  (a)  may  seem  somewhat  obscure  but  it  can  be  easily  checked 
and  applied  in  a  number  of  situations,  as  illustrated  in  Example  4.5  and  4.6  below.  Part 
(c)  says  that  in  many  other  cases  of  interest  where  neither  part  (a)  nor  (b)  applies,  even  if 
a  distribution  on  IZ  exists  satisfying  CAR,  the  probabilities  of  making  the  observations  are 
completely  determined  by  the  probability  of  various  events  in  the  world  occurring,  which 
seems  rather  unreasonable. 


253 


Grunwald  &  Halpern 


Example  4.5:  Consider  the  CARacterizing  matrix  of  Example  4.2.  Notice  there  exists  an 
affine  combination  of  the  first  two  rows  that  is  not  0  and  has  no  negative  components: 


Similarly,  there  exists  an  affine  combination  of  the  last  two  rows  that  is  not  0  and  has  no 
negative  components.  It  follows  from  Theorem  4.4(a)  that  there  is  no  distribution  satisfying 
CAR  that  gives  both  of  the  observations  Xo  =  U\  and  Xo  =  U2  positive  probability  and 
either  (a)  gives  both  Xw  G  U\  —  U2  and  X\y  G  U\  0  U2  positive  probability  or  (b)  gives 
both  Xiy  G  U2  —  U\  and  Xw  G  U\  ft  U2  positive  probability.  If  both  observations  have 
positive  probability,  then  CAR  can  hold  only  if  the  probability  of  U\  fl  U2  is  either  0  or  1. 
(Example  3.3  already  shows  this  using  a  more  direct  argument.)  | 

The  next  example  further  illustrates  that  in  general,  it  can  be  very  difficult  to  satisfy 
CAR. 

Example  4.6:  Suppose  that  O  =  { U\ .  U2 ■  C/j } ■  and  all  three  observations  can  be  made 
with  positive  probability.  It  turns  out  that  in  this  situation  the  CAR  condition  can  hold, 
but  only  if  (a)  Px{Xw  G  U\  7  U2  C.  C/3)  =  1  (i.e.,  all  of  U\,  C/2,  and  C/3  must  hold),  (b) 
Pr(Xw  G  ((C/i  n  U2)  ~  C/3)  u  ((C/2  n  C/3)  -  C/i)  U  ((C/i  n  C/3)  -  U2))  =  1  (i.e.,  exactly  two  of  Uu 
C/2,  and  U3  must  hold),  (c)  Pr{Xw  G  (C/i-(C/2UC/3))U(C/2-(C/iUC/3))U(C/3-(C/2UC/i)))  =  1 
(i.e.,  exactly  one  of  C7.  C/2,  or  C/3  must  hold),  or  (d)  one  of  {U\  —  (C/2  U  C/3))  U  (C/2  fl  C/3) , 
(C/2  —  (C/i  U  C/3))  U  (C/i  fl  C/3)  or  (C/3  —  {U\  U  C/2))  U  (Ui  fl  CC2)  has  probability  1  (either  exactly 
one  of  U\,  C/2 ,  or  C/3  holds,  or  the  remaining  two  both  hold). 

We  first  check  that  CAR  can  hold  in  all  these  cases.  It  should  be  clear  that  CAR 
can  hold  in  case  (a).  Moreover,  there  are  no  constraints  on  Pr(Ao  =  C/j  Xw  =  w)  for 
w  G  U\  fl  C/2  fl  C/3  (except,  by  the  CAR  condition,  for  each  fixed  i,  the  probability  must  be 
the  same  for  all  w  G  U\  fl  U2  H  C/3 ,  and  the  three  probabilities  must  sum  to  1). 

For  case  (b),  let  A,-  be  the  atom  where  exactly  two  of  U\,  C/2,  and  C/3  hold,  and  C/j  does 
not  hold,  for  i  =  1,2,3.  Suppose  that  Pr(Xw  G  A\  U  A2  U  A3)  =  1.  Note  that,  since  all 
three  observations  can  be  made  with  positive  probability,  at  least  two  of  A\.  A2,  and  A3 
must  have  positive  probability.  Hence  we  can  distinguish  between  two  subcases:  (i)  only 
two  of  them  have  positive  probability,  and  (ii)  all  three  have  positive  probability. 

For  subcase  (i),  suppose  without  loss  of  generality  that  only  A\  and  A2  have  positive 
probability.  Then  it  immediately  follows  from  the  CAR  condition  that  there  must  be  some 
a  with  0  <  a  <  1  such  that  Pi’(Aq  =  C/3  j  X\y  =  w)  =  a,  for  all  w  G  4i  U  i2  such 
that  Pr(Aw  =  w )  >  0.  Thus,  Pr(Ao  =  U\  \  Xw  =  w )  —  1  —  a  for  all  w  G  A2  such 
that  Pt(X[y  =  w )  >  0,  and  Pr(Ao  =  C/2  |  Xw  =  w )  =  1  —  a  for  all  w  G  A\  such  that 
Pr(Aw  =  w )  >  0. 

Subcase  (ii)  is  more  interesting.  The  rows  of  the  CARacterizing  matrix  S  corresponding 
to  A\,  A2,  and  A3  are  (0  1  1),  (1  0  1),  and  (1  1  0),  respectively.  Now  Lemma  4.3(a)  tells 
us  that  if  Pr  satisfies  CAR,  then  we  must  have  S  ■  7T  =  1T  for  some  7  =  (71,72,73)  with 
7 i  =  Pr(Ao  =  C/j  |  X\y  G  C/j).  These  three  linear  equations  have  solution 

1 

71  =  72  =  73  =  7- 
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Since  this  solution  is  unique,  it  follows  by  Lemma  4.3(b)  that  all  distributions  that  satisfy 
CAR  must  have  conditional  probabilities  Pr(Xo  =  Ui  \  X\y  G  Ui)  =  1/2,  and  that  their 
marginal  distributions  on  IT  can  be  arbitrary.  This  fully  characterizes  the  set  of  distribu¬ 
tions  Pr  for  which  CAR  holds  in  this  case.  Note  that  for  i  =  1,2,3,  since  we  can  write 
7 i  =  Pt(Xq  =  Ui) /  Pt(X[a/  G  Ui)  we  have  Pi{Xq  =  Ui)  =  Pr(Xw  G  Ui) ji  <  1/2  so  that,  in 
contrast  to  the  marginal  distribution  over  IT,  the  marginal  distribution  over  O  cannot  be 
chosen  arbitrarily. 

In  case  (c),  it  should  also  be  clear  that  CAR  can  hold.  Moreover,  Pr(Xo  =  Ui  \  X\y  =  w) 
is  either  0  or  1,  depending  on  whether  w  G  U,,.  Finally,  for  case  (d),  suppose  that  Pr(X\y  G 
U\  U  (C/2  fl  C/3))  =  1.  CAR  holds  iff  there  exists  a  such  that  Pr(X<p  =  C/2  |  X\y  =  w)  =  a 
and  Py{Xo  =  C/3  |  Xw  =  w)  =  1  —  a  for  all  w  G  C/2  fl  C/3  such  that  Pr(Xw  =  w)  >  0.  (Of 
course,  Pr(Xo  =  U\  \  Xw  =  w)  =  1  for  all  w  G  U\  such  that  Pr(Xw  =  w)  >  0.) 

Now  we  show  that  CAR  cannot  hold  in  any  other  case.  First  suppose  that  0  <  Pr(Xw  G 
U1rU2DU3)  <  1.  Thus,  there  must  be  at  least  one  other  atom  A  such  that  Pi(X[y  G  A)  >  0. 
The  row  corresponding  to  the  atom  UiUU2UU3  is  (111).  Suppose  r  is  the  row  corresponding 
to  the  other  atom  A.  Since  S'  is  a  0-1  matrix,  the  vector  (1  1  1)  —  r  gives  is  an  affine 
combination  of  (1  1  1)  and  r  that  is  nonzero  and  has  nonnegative  components.  It  now 
follows  by  Theorem  4.4  that  CAR  cannot  hold  in  this  case. 

Similar  arguments  give  a  contradiction  in  all  the  other  cases;  we  leave  details  to  the 
reader.  | 

4.4  Discussion:  “CAR  is  Everything”  vs.  “Sometimes  CAR  is  Nothing” 

In  one  of  their  main  theorems,  Gill,  van  der  Laan,  and  Robins  (1997,  Section  2)  show  that 
the  CAR  assumption  is  untestable  from  observations  of  Xo  alone,  in  the  sense  that  the 
assumption  “Pr  satisfies  CAR”  imposes  no  restrictions  at  all  on  the  marginal  distribution 
Pr0  on  Xq •  More  precisely,  they  show  that  for  every  finite  set  W  of  worlds,  every  set 
O  of  observations,  and  every  distribution  Po  on  O,  there  is  a  distribution  Pr*  on  7 Z  such 
that  Pi'o  (the  marginal  of  Pr*  on  O)  is  equal  to  Po  and  Pr*  satisfies  CAR.  The  authors 
summarize  this  as  “CAR  is  everything” . 

We  must  be  careful  in  interpreting  this  result.  Theorem  4.4  shows  that,  for  many 
combinations  of  O  and  W,  CAR  can  hold  only  for  distributions  Pr  with  Pi{Xw  G  A)  =  0 
for  some  atoms  A.  (In  the  previous  sections,  we  called  such  distributions  “degenerate”.) 
In  our  view,  this  says  that  in  some  cases,  CAR  effectively  cannot  hold.  To  see  why,  first 
suppose  we  are  given  a  set  W  of  worlds  and  a  set  O  of  observations.  Now  we  may  feel 
confident  a  priori  that  some  Uq  G  O  and  some  wq  G  W  cannot  occur  in  practice.  In  this 
case,  we  are  willing  to  consider  only  distributions  Pr  on  O  x  IT  that  have  Pr(Ao  =  Uo)  =  0, 
Pr  (Xw  =  wo)  =  0.  (For  example,  IT  may  be  a  product  space  IT  =  Wa  x  W&  and  it 
is  known  that  some  combination  wa  G  Wa  and  Wb  in  Wj,  can  never  occur  together;  then 
Pr(X„,  =  (wa,  Wb))  =  0.)  Define  O*  to  be  the  subset  of  O  consisting  of  all  U  that  we  cannot 
a  priori  rule  out;  similarly,  IT*  is  the  subset  of  IT  consisting  of  all  w  that  we  cannot  a  priori 
rule  out.  By  Theorem  4.4,  it  is  still  possible  that  O*  and  IT*  are  such  that,  even  if  we  restrict 
to  runs  where  only  observations  in  O*  are  made,  CAR  can  only  hold  if  Pr  (Xw  G  A)  =  0 
for  some  atoms  (nonempty  subsets)  A  C  IT*.  This  means  that  CAR  may  force  us  to  assign 
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probability  0  to  some  events  that,  a  priori,,  were  considered  possible.  Examples  3.3  and  4.6 
illustrate  this  phenomenon.  We  may  summarize  this  as  “sometimes  CAR  is  nothing” . 

Given  therefore  that  CAR  imposes  such  strong  conditions,  the  reader  may  wonder  why 
there  is  so  much  study  of  the  CAR  condition  in  the  statistics  literature.  The  reason  is 
that  some  of  the  special  situations  in  which  CAR  holds  often  arise  in  missing  data  and 
survival  analysis  problems.  Here  is  an  example:  Suppose  that  the  set  of  observations  can  be 
written  as  O  =  U,f=ln,;,  where  each  H;  is  a  partition  of  IT  (that  is,  a  set  of  pairwise  disjoint 
subsets  of  IT  whose  union  is  IT).  Further  suppose  that  observations  are  generated  by  the 
following  process,  which  we  call  CARgen.  Some  i  between  1  and  k  is  chosen  according 
to  some  arbitrary  distribution  Pq]  independently,  w  G  IT  is  chosen  according  to  Pw  ■  The 
agent  then  observes  the  unique  U  G  n,;  such  that  w  G  U.  Intuitively,  the  partitions  H,;  may 
represent  the  observations  that  can  be  made  with  a  particular  sensor.  Thus,  Pq  determines 
the  probability  that  a  particular  sensor  is  chosen;  P\y  determines  the  probability  that  a 
particular  world  is  chosen.  The  sensor  and  the  world  together  determine  the  observation 
that  is  made.  It  is  easy  to  see  that  this  mechanism  induces  a  distribution  on  1Z  for  which 
CAR  holds. 

The  special  case  with  O  =  niUn2,  ni  =  {IT},  and  n2  =  {{w}  \  w  G  W}  corresponds  to  a 
simple  missing  data  problem  (Example  4.7  below).  Intuitively,  either  complete  information 
is  given,  or  there  is  no  data  at  all.  In  this  context,  CAR  is  often  called  MAR:  missing 
at  random.  In  more  realistic  MAR  problems,  we  may  observe  a  vector  with  some  of  its 
components  missing.  In  such  cases  the  CAR  condition  sometimes  still  holds.  In  practical 
missing  data  problems,  the  goal  is  often  to  infer  the  distribution  Pr  on  runs  1Z  from  successive 
observations  of  Xo ■  That  is,  one  observes  a  sample  £/(i),  1/(2),  ■  ■■,  £/(„),  where  G  O. 
Typically,  the  are  assumed  to  be  an  i.i.d.  (independently  identically  distributed)  sample 
of  outcomes  of  Xo-  The  corresponding  “worlds”  w\,w%7...  (outcomes  of  Xw )  are  not 
observed.  Depending  on  the  situation,  Pr  may  be  completely  unknown  or  is  assumed  to  be 
a  member  of  some  parametric  family  of  distributions.  If  the  number  of  observations  n  is 
large,  then  clearly  the  sample  ,  Uq)  ,  •  •  • ,  £7(„)  can  be  used  to  obtain  a  reasonable  estimate 

of  Pr0,  the  marginal  distribution  on  Xq-  But  one  is  interested  in  the  full  distribution  Pr. 
That  distribution  usually  cannot  be  inferred  without  making  additional  assumptions,  such 
as  the  CAR  assumption. 

Example  4.7 :  (adapted  from  (Scharfstein,  Daniels,  &  Robins,  2002))  Suppose  that  a 
medical  study  is  conducted  to  test  the  effect  of  a  new  drug.  The  drug  is  administered 
to  a  group  of  patients  on  a  weekly  basis.  Before  the  experiment  is  started  and  after  it  is 
finished,  some  characteristic  (say,  the  blood  pressure)  of  the  patients  is  measured.  The  data 
are  thus  differences  in  blood  pressure  for  individual  patients  before  and  after  the  treatment. 
In  practical  studies  of  this  kind,  often  several  of  the  patients  drop  out  of  the  experiment. 
For  such  patients  there  is  then  no  data.  We  model  this  as  follows:  IT  is  the  set  of  possible 
values  of  the  characteristic  we  are  interested  in  (e.g.,  blood  pressure  difference).  O  =  niUn2 
with  ni  =  {IT},  and  n2  =  {{tc}  ]  w  G  IT}  as  above.  For  “compilers”  (patients  that  did 
not  drop  out),  we  observe  Xq  =  {ic},  where  w  is  the  value  of  the  characteristic  we  want 
to  measure.  For  dropouts,  we  observe  Xo  =  IT  (that  is,  we  observe  nothing  at  all).  We 
thus  have,  for  example,  a  sequence  of  observations  U\  =  {/uq },  U-2  =  {  =  IT,  t/4  = 

{^4}, t/5  =  W,...,Un  =  {wn}.  If  this  sample  is  large  enough,  we  can  use  it  to  obtain  a 
reasonable  estimate  of  the  probability  that  a  patient  drops  out  (the  ratio  of  outcomes  with 
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Ui  =  W  to  the  total  number  of  outcomes).  We  can  also  get  a  reasonable  estimate  of  the 
distribution  of  Xw  for  the  complying  patients.  Together  these  two  distributions  determine 
the  distribution  of  Xo- 

We  are  interested  in  the  effect  of  the  drug  in  the  general  population.  Unfortunately,  it 
may  be  the  case  that  the  effect  on  dropouts  is  different  from  the  effect  on  compilers.  (Scharf- 
stein,  Daniels,  and  Robins  (2002)  discuss  an  actual  medical  study  in  which  physicians  judged 
the  effect  on  dropouts  to  be  very  different  from  the  effect  of  compilers.)  Then  we  cannot 
infer  the  distribution  on  W  from  the  observations  U\ ,  U$jk . .  alone  without  making  addi¬ 
tional  assumptions  about  how  the  distribution  for  dropouts  is  related  to  the  distribution  for 
compilers.  Perhaps  the  simplest  such  assumption  that  one  can  make  is  that  the  distribution 
of  Xw  for  dropouts  is  in  fact  the  same  as  the  distribution  of  Xw  for  compilers:  the  data 
are  “missing  at  random”.  Of  course,  this  assumption  is  just  the  CAR  assumption.  By 
Theorem  3.1(a),  CAR  holds  iff  for  all  w  G  W 


Pr(Xw  =  w  |  X0  =  W)  =  Pr(Xw  =  w  \XW  eW)  =  Pr(Xw  =  w), 

which  means  just  that  the  distribution  of  W  is  independent  of  whether  a  patient  drops  out 
(Xo  =  W)  or  not.  Thus,  if  CAR  can  be  assumed,  then  we  can  infer  the  distribution  on  W 
(which  is  what  we  are  really  interested  in).  | 

Many  problems  in  missing  data  and  survival  analysis  are  of  the  kind  illustrated  above:  The 
analysis  would  be  greatly  simplified  if  CAR  holds,  but  whether  or  not  this  is  so  is  not  clear. 
It  is  therefore  of  obvious  interest  to  investigate  whether,  from  observing  the  “coarsened” 
data  17(2),  >->->'■■>  U(n)  alone,  it  may  already  be  possible  to  test  the  assumption  that  CAR 
holds.  For  example,  one  might  imagine  that  there  are  distributions  on  Xq  for  which  CAR 
simply  cannot  hold.  If  the  empirical  distribution  of  the  17,;  were  “close”  (in  the  appropriate 
sense)  to  a  distribution  that  rules  out  CAR,  the  statistician  might  infer  that  Pr  does  not 
satisfy  CAR.  Unfortunately,  if  O  is  finite,  then  the  result  of  Gill,  van  der  Laan,  and  Robins 
(1997,  Section  2)  referred  to  at  the  beginning  of  this  section  shows  that  we  can  never  rule 
out  CAR  in  this  way. 

We  are  interested  in  the  question  of  whether  CAR  can  hold  in  a  “nondegenerate”  sense, 
given  O  and  W.  From  this  point  of  view,  the  slogan  “sometimes  CAR  is  nothing”  makes 
sense.  In  contrast,  Gill  et  al.  (1997)  were  interested  in  the  question  whether  CAR  can 
be  tested  from  observations  of  Xq  alone.  From  that  point  of  view,  the  slogan  “CAR  is 
everything”  makes  perfect  sense.  In  fact,  Gill,  van  der  Laan,  and  Robins  were  quite  aware, 
and  explicitly  stated,  that  CAR  imposes  very  strong  assumptions  on  the  distribution  Pr.  In 
a  later  paper,  it  was  even  implicitly  stated  that  in  some  cases  CAR  forces  Pr(Xw  G  A)  =  0 
for  some  atoms  A  (Robins,  Rotnitzky,  &  Scharfstein,  1999,  Section  9.1).  Our  contribution  is 
to  provide  the  precise  conditions  (Lemma  4.3  and  Theorem  4.4)  under  which  this  happens. 

Robins,  Rodnitzky,  and  Scharfstein  (1999)  also  introduced  a  Bayesian  method  (later 
extended  in  by  Scharfstein  et  al.  (2002))  that  allows  one  to  specify  a  prior  distribution 
over  a  parameter  a  which  indicates,  in  a  precise  sense,  how  much  Pr  deviates  from  CAR. 
For  example,  a  =  0  corresponds  to  the  set  of  distributions  Pr  satisfying  CAR.  The  precise 
connection  between  this  work  and  ours  needs  further  investigation. 
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4.5  A  Mechanism  for  Generating  Distributions  Satisfying  CAR 

In  Theorem  3.1  and  Lemma  4.3  we  described  CAR  in  an  algebraic  way,  as  a  collection  of 
probabilities  satisfying  certain  equalities.  Is  there  a  more  “procedural”  way  of  representing 
CAR?  In  particular,  does  there  exist  a  single  mechanism  that  gives  rise  to  CAR  such 
that  every  case  of  CAR  can  be  viewed  as  a  special  case  of  this  mechanism?  We  have 
already  encountered  a  possible  candidate  for  such  a  mechanism:  the  CARgen  procedure. 
In  Section  4.4  we  described  this  mechanism  and  indicated  that  it  generates  only  distributions 
that  satisfy  CAR.  Unfortunately,  as  we  now  show,  there  exist  CAR  distributions  that  cannot 
be  interpreted  as  being  generated  by  CARgen. 

Our  example  is  based  on  an  example  given  by  Gill,  van  der  Laan,  and  Robins  (1997),  who 
were  actually  the  first  to  consider  whether  there  exist  ‘natural’  mechanisms  that  generate 
all  and  only  distributions  satisfying  CAR.  They  show  that  in  several  problems  of  survival 
analysis,  observations  are  generated  according  to  what  they  call  a  randomized  monotone 
coarsening  scheme.  They  also  show  that  their  randomized  scheme  generates  only  distribu¬ 
tions  that  satisfy  CAR.  In  fact,  the  randomized  monotone  coarsening  scheme  turns  out  to 
be  a  special  case  of  CARgen,  although  we  do  not  prove  this  here.  Gill,  van  der  Laan,  and 
Robins  show  by  example  that  the  randomized  coarsening  schemes  do  not  suffice  to  generate 
all  CAR  distributions.  Essentially  the  same  example  shows  that  CARgen  does  not  either. 

Example  4.8:  Consider  subcase  (ii)  of  Example  4.6  again.  Let  L/ 1 ,  L/2 ,  U3  and  Ai,  A 2  and 
A3  be  as  in  that  example,  and  assume  for  simplicity  that  W  =  A\  U  A2  U  A3.  The  example 
showed  that  there  exists  distributions  Pr  satisfying  CAR  in  this  case  with  Pr(A;)  >  0  for 
i  G  {1,2,3},  all  having  conditional  probabilities  Pr(Xo  =  Ui\Xyy  =  w )  =  1/2  for  all 
w  G  U{.  Clearly,  U\ .  U2  and  I/3  cannot  be  grouped  together  to  form  a  set  of  partitions  of 
W .  So,  even  though  CAR  holds  for  Pr,  CARgen  cannot  be  used  to  simulate  Pr.  | 

While  Gill,  van  der  Laan,  and  Robins  (1997,  Section  3)  ask  whether  there  exists  a 
general  mechanism  for  generating  all  and  only  CAR  distributions,  they  do  not  make  this 
question  mathematically  precise.  As  noted  by  Gill,  van  der  Laan,  and  Robins,  the  problem 
here  is  that  without  any  constraint  to  what  constitutes  a  valid  mechanism,  there  is  clearly 
a  trivial  solution  to  the  problem:  Given  a  distribution  Pr  satisfying  CAR,  we  simply  draw  a 
world  w  according  to  Pr^v,  and  then  draw  U  such  that  w  G  U  according  to  the  distribution 
Pr(Ao  =  U  \  X\y  =  w ).  This  is  obviously  cheating  in  some  sense.  Intuitively,  it  seems  that 
a  ‘reasonable’  mechanism  should  not  be  allowed  to  choose  U  according  to  a  distribution 
depending  on  w.  It  does  not  have  that  kind  of  control  over  the  observations  that  are  made. 

So  what  counts  as  a  “reasonable”  mechanism?  Intuitively,  the  mechanism  should  be 
able  to  control  only  what  can  be  controlled  in  an  experimental  setup.  We  can  think  of  the 
mechanism  as  an  “agent”  that  uses  a  set  of  sensors  to  obtain  information  about  the  world. 
The  agent  does  not  have  control  over  Pyy .  While  the  agent  may  certainly  choose  which 
sensor  to  use,  it  is  not  reasonable  to  assume  that  she  can  control  their  output  (or  exactly 
what  they  can  sense).  Indeed,  given  a  world  w,  the  observation  returned  by  the  sensor  is 
fully  determined.  This  is  exactly  the  setup  implemented  in  the  CARgen  scheme  discussed 
in  Section  4.4,  which  we  therefore  regard  as  a  legitimate  mechanism.  We  now  introduce 
a  procedure  CARgen*,  which  extends  CARgen,  and  turns  out  to  generate  all  and  only 
distributions  satisfying  CAR.  Just  like  CARgen,  CARgen*  assumes  that  there  is  a  col¬ 
lection  of  sensors,  and  it  consults  a  given  sensor  with  a  certain  predetermined  probability. 
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However,  unlike  CARgen,  CARgen*  may  ignore  a  sensor  reading.  The  decision  whether 
or  not  a  sensor  reading  is  ignored  is  allowed  to  depend  only  on  the  sensor  that  has  been 
chosen  by  the  agent,  and  on  the  observation  generated  by  that  sensor.  It  is  not  allowed  to 
depend  on  the  actual  world,  since  the  agent  may  not  know  what  the  actual  world  is.  Once 
again,  the  procedure  is  “reasonable”  in  that  the  agent  is  only  allowed  to  control  what  can 
be  controlled  in  an  experimental  setup. 

We  now  present  CARgen*,  then  argue  that  it  is  “reasonable”  in  the  sense  above,  and 
that  it  generates  all  and  only  CAR  distributions. 

Procedure  CARgen* 

1.  Preparation: 

•  Fix  an  arbitrary  distribution  P\y  on  W. 

•  Fix  a  set  V  of  partitions  of  W,  and  fix  an  arbitrary  distribution  Pp  on  V. 

•  Choose  numbers  q  G  [0,1)  and  qu\u  €  [0,1]  for  each  pair  (17,  n)  such  that  n  G 
V  and  17  G  n  satisfying  the  following  constraint,  for  each  w  G  W  such  that 
Pw(w )  >  0: 

Q  =  pv(n)qu\n-  (4) 

{{u,ny.  weu,ueu} 

2.  Generation: 

2.1  Choose  w  G  W  according  to  Pw- 

2.2  Choose  n  G  V  according  to  Pp.  Let  U  be  the  unique  set  in  n  such  that  w  G  U. 

2.3  With  probability  1  —  qv |n,  return  ( w,U )  and  halt.  With  probability  qu\n,  go  to 
step  2.2. 

It  is  easy  to  see  that  CARgen  is  the  special  case  of  CARgen*  where  qjj |n  =  0  for  all 
(17,  n).  Allowing  qjj\u  >  0  gives  us  a  little  more  flexibility.  To  understand  the  role  of  the 
constraint  (4),  note  that  qv jn  is  the  probability  that  the  algorithm  does  not  terminate  at 
step  2.3,  given  that  U  and  n  are  chosen  at  step  2.2.  It  follows  that  the  probability  qw  that 
a  pair  ( w ,  U)  is  not  output  at  step  2.3  for  some  U  is 

Qw  =  ^  pv(n)Qu\n- 

{(V,  n):  weu,ueu} 

Thus,  (4)  says  that  the  probability  qw  that  a  pair  whose  first  component  is  w  is  not  output 
at  step  2.3  is  the  same  for  all  w  G  W . 

CARgen*  can  generate  the  CAR  distribution  in  Example  4.8,  which  could  not  be 
generated  by  CARgen.  To  see  this,  using  the  same  notation  as  in  the  example,  consider 
the  set  of  partitions  V  =  {ni,n2,n.3}  with  11;  =  {17;,  A,;}.  Let  Pp(ni)  =  Pp{Ii2)  = 
Pp{U3)  =  1/3,  quyiii  =  0,  and  qA^Ui  =  1-  It  is  easy  to  verify  that  for  all  w  G  W,  we  have 
that  J2{u,n:weU,Uen}  pv(J^)Qu\u  =  1/3,  so  that  the  constraint  (4)  is  satisfied.  Moreover, 
direct  calculation  shows  that,  for  arbitrary  P\y .  the  distribution  Pr*  on  runs  generated 
by  CARgen*  with  this  choice  of  parameters  is  precisely  the  unique  distribution  satisfying 
CAR  in  this  case. 
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So  why  is  CARgen*  reasonable?  Even  though  we  have  not  given  a  formal  definition 
of  “reasonable”  (although  it  can  be  done  in  the  runs  framework — essentially,  each  step  of 
the  algorithm  can  depend  only  on  information  available  to  the  experimenter,  where  the 
“information”  is  encoded  in  the  observations  made  by  the  experimenter  in  the  course  of 
running  the  algorithm),  it  is  not  hard  to  see  that  CARgen*  satisfies  our  intuitive  desider¬ 
ata.  The  key  point  is  that  all  the  relevant  steps  in  the  algorithm  can  be  carried  out  by 
an  experimenter.  The  parameters  q  and  qv\ n  for  II  G  V  and  U  G  II  are  chosen  before  the 
algorithm  begins;  this  can  certainly  be  done  by  an  experimenter.  Similarly,  it  is  straight¬ 
forward  to  check  that  the  equation  (4)  holds  for  each  w  G  XV.  As  for  the  algorithm  itself, 
the  experimenter  has  no  control  over  the  choice  of  w\  this  is  chosen  by  nature  according 
its  distribution,  Pw-  However,  the  experimenter  can  perform  steps  2.2  and  2.3,  that  is 
choosing  n  G  V  according  to  the  probability  distribution  Pp,  and  rejecting  the  observation 
U  with  probability  qv |n,  since  the  experimenter  knows  both  the  sensor  chosen  (i.e.,  n)  and 
the  observation  (17). 

The  following  theorem  shows  that  CARgen*  does  exactly  what  we  want. 

Theorem  4.9:  Given  a  set  1Z  of  runs  over  a  set  XV  of  worlds  and  a  set  O  of  observations, 
Pr  is  a  distribution  on  1Z  that  satisfies  CAR  if  and  only  if  there  is  a  setting  of  the  parameters 
in  (step  1  of)  CARgen*  such  that,  for  all  w  G  XV  and  U  G  O,  Pr({r  :  Xw(r)  =  w,  Xo(r)  = 
U})  is  the  probability  that  CARgen*  returns  ( w,U ). 

5.  Beyond  Observations  of  Events 

In  the  previous  section,  we  assumed  that  the  information  received  was  of  the  form  “the 
actual  world  is  in  {/”.  Information  must  be  in  this  form  to  apply  conditioning.  But  in 
general,  information  does  not  always  come  in  such  nice  packages.  In  this  section  we  study 
more  general  types  of  observations,  leading  to  generalizations  of  conditioning. 

5.1  Jeffrey  Conditioning 

Perhaps  the  simplest  possible  generalization  is  to  assume  that  there  is  a  partition  {Ui, . . .  ,  Un } 
of  XV  and  the  agent  observes  o  i t/p . . . ;  anUn,  where  aq  +  •••  +  ••■  an  =  1.  This  is  to  be 
interpreted  as  an  observation  that  leads  the  agent  to  believe  Uj  with  probability  a.j,  for 
j  =  1, ...  ,n.  According  to  Jeffrey  conditioning,  given  a  distribution  Pw  on  W, 

Pw(V  |  aiUi] . . . ;  anUn )  =  aiPw(V  \  U\)  H - +  anPw(V  j  Un). 

Jeffrey  conditioning  is  defined  only  if  cp  >  0  implies  that  P\y(Ui)  >  0;  if  a,-  =  0  and 
P\v{Uj)  =  0,  then  ajPw{V  \  Ui)  is  taken  to  be  0.  Clearly  ordinary  conditioning  is  the  special 
case  of  Jeffrey  conditioning  where  cp  =  1  for  some  i  so,  as  is  standard,  we  deliberately  use 
the  same  notation  for  updating  using  Jeffrey  conditioning  and  ordinary  conditioning. 

We  now  want  to  determine  when  updating  in  the  naive  space  using  Jeffrey  condition¬ 
ing  is  appropriate.  Thus,  we  assume  that  the  agent’s  observations  now  have  the  form  of 
a\U\\ . . . ;  anUn  for  some  partition  {Ui, . . .  ,  Un}  of  XV.  (Different  observations  may,  in  gen¬ 
eral,  use  different  partitions.)  Just  as  we  did  for  the  case  that  observations  are  events 
(Section  3,  first  paragraph),  we  once  again  assume  that  the  agent’s  observations  are  accu¬ 
rate.  What  does  that  mean  in  the  present  context?  We  simply  require  that,  conditional 
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on  making  the  observation,  the  probability  of  Ui  really  is  tp  for  i  =  1, . . .  ,n.  That  is,  for 
i  =  1, . . . ,  n,  we  have 

Pr(Xw  G  Ui  |  X0  =  aiUf, . . . ;  anUn )  =  a*.  (5) 

This  clearly  generalizes  the  requirement  of  accuracy  given  in  the  case  that  the  observations 
are  events. 

Not  surprisingly,  there  is  a  generalization  of  the  CAR  condition  that  is  needed  to  guar¬ 
antee  that  Jeffrey  conditioning  can  be  applied  to  the  naive  space. 

Theorem  5.1:  Fix  a  probability  Pr  on  TZ,  a  partition  {Ui, . . . ,  Un}  of  W,  and  probabilities 
ai, . . .  ,  an  such  that  a\  +  •  •  •  +  an  =  1.  Let  C  be  the  observation  a\U\\ . . . ;  anUn .  Fix  some 
i  G  {1, . . .  ,n}.  Then  the  following  are  equivalent: 

(a)  If  Pr(Ao  =  C)  >  0,  then  Fr{X\y  =  w  j  Xo  =  C)  =  Pr^(uj  |  ol\Ui\ . . . ;  anUn )  for  all 
w  G  Ui- 

(b)  Pr(Xo  =  C  |  Xw  =  w)  =  Pr(Xo  =  C  \  Xw  €  Ui)  for  all  w  G  U.t  such  that  Pr(Aw  = 
w)  >  0. 

Part  (b)  of  Theorem  5.1  is  analogous  to  part  (c)  of  Theorem  3.1.  There  are  a  number 
of  conditions  equivalent  to  (b)  that  we  could  have  stated,  similar  in  spirit  to  the  conditions 
in  Theorem  3.1.  Note  that  these  are  even  more  stringent  conditions  than  are  required  for 
ordinary  conditioning  to  be  appropriate. 

Examples  3.3  and  4.6  already  suggest  that  there  are  not  too  many  nontrivial  scenarios 
where  applying  Jeffrey  conditioning  to  the  naive  space  is  appropriate.  However,  just  as  for 
the  original  CAR  condition,  there  do  exist  special  situations  in  which  generalized  CAR  is 
a  realistic  assumption.  For  ordinary  CAR,  we  mentioned  the  CARgen  mechanism  (Sec¬ 
tion  4.5).  For  Jeffrey  conditioning,  a  similar  mechanism  may  be  a  realistic  model  in  some 
situations  where  all  observations  refer  to  the  same  partition  {U\,  ...,[/„}  of  W .  We  now 
describe  a  scenario  for  such  a  situation.  Suppose  O  consists  of  k  >  1  observations  C\. ...  .  C'k 
with  C'i  =  OLi\U\  \ . . .  \a.inUn  such  that  all  onj  >  0.  Now,  fix  n  (arbitrary)  conditional  dis¬ 
tributions  Pi'j,  j  =  1, ...  ,n,  on  W.  Intuitively,  Pr j  is  Pr^(-  ]  Uj).  Consider  the  following 
mechanism:  first  an  observation  C,;  is  chosen  (according  to  some  distribution  Po  on  O); 
then  a  set  Uj  is  chosen  with  probability  cpy  (i.e.,  according  to  the  distribution  induced  by 
C'i ) ;  finally,  a  world  w  G  Uj  is  chosen  according  to  Pr? . 

If  the  observation  C\  and  world  w  are  generated  this  way,  then  the  generalized  CAR 
condition  holds,  that  is,  conditioning  in  the  sophisticated  space  coincides  with  Jeffrey  con¬ 
ditioning: 

Proposition  5.2:  Consider  a  partition  {U\ . . . . ,  Un }  of  W  and  a  set  of  k  >  1  observations 
O  as  above.  For  every  distribution  Po  on  O  with  Po{Ci )  >  0  for  all  i  G  {1, . . .  ,  k},  there 
exists  a  distribution  Pr  on  TZ  such  that  Po  =  Pro  (Te,  Po  is  the  marginal  of  Pr  on  O)  and 
Pr  satisfies  the  generalized  CAR  condition  (Theorem  5.1(b))  for  U\.....  Un . 

Proposition  5.2  demonstrates  that,  even  though  the  analogue  of  the  CAR  condition 
expressed  in  Theorem  5.1  is  hard  to  satisfy  in  general,  at  least  if  the  set  { U\  .  is  the 

same  for  all  observations,  then  for  every  such  set  of  observations  there  exist  some  priors  Pr 
on  TZ  for  which  the  CAR-analogue  is  satisfied  for  all  observations.  As  we  show  next,  for 
AIRE  updating,  this  is  no  longer  the  case. 
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5.2  Minimum  Relative  Entropy  Updating 


What  about  cases  where  the  constraints  are  not  in  the  special  form  where  Jeffrey’s  condition¬ 
ing  can  be  applied?  Perhaps  the  most  common  approach  in  this  case  is  to  use  MRE.  Given 
a  constraint  (where  a  constraint  is  simply  a  set  of  probability  distributions — intuitively,  the 
distributions  satisfying  the  constraint)  and  a  prior  distribution  Pw  on  IV.  the  idea  is  to 
pick,  among  all  distributions  satisfying  the  constraint,  the  one  that  is  “closest”  to  the  prior 
distribution,  where  the  “closeness”  of  P'w  to  Pw  is  measured  using  relative  entropy.  The 
relative  entropy  between  P'w  and  Pw  (Kullback  &  Leibler,  1951;  Cover  &  Thomas,  1991)  is 
defined  as 


J2  Pw  Mlo§ 

wEW 


P\v(w) 

PwH 


(The  logarithm  here  is  taken  to  the  base  2;  if  P^(w)  =  0  then  P[/V{w)  log (P^(w) / Pw{w)) 
is  taken  to  be  0.  This  is  reasonable  since  lim^o  x  log(x/c)  =  0  if  c  >  0.)  The  relative  en¬ 
tropy  is  finite  provided  that  P'w  is  absolutely  continuous  with  respect  to  Pw,  hi  that  if 
Pw{w)  =  0,  then  P^(w)  =  0,  for  all  w  G  W.  Otherwise,  it  is  defined  to  be  infinite. 

The  constraints  we  consider  here  are  all  closed  and  convex  sets  of  probability  measures. 
In  this  case,  it  is  known  that  there  is  a  unique  distribution  that  satisfies  the  constraints 
and  minimizes  the  relative  entropy.  Given  a  nonempty  constraint  C  and  a  probability 
distribution  Pw  on  W,  let  Pw(m  I  C)  denote  the  distribution  that  minimizes  relative  entropy 
with  respect  to  Pw- 

If  the  constraints  have  the  form  to  which  Jeffrey’s  Rule  is  applicable,  that  is,  if  they 
have  the  form  {P'w  :  P{/y{  U{)  =  ai,i  =  1 n}  for  some  partition  {(q .  — .  U„  ) .  then 
it  is  well  known  that  the  distribution  that  minimizes  entropy  relative  to  a  prior  Pw  is 
Pw(m  |  ct\U\  \ .  OLnUn )  (see,  e.g.,  (Diaconis  &  Zabell,  1986)).  Thus,  MRE  updating  gener¬ 
alizes  Jeffrey  conditioning  (and  hence  also  standard  conditioning). 

To  study  MRE  updating  in  our  framework,  we  assume  that  the  observations  are  now 
arbitrary  closed  convex  constraints  on  the  probability  measure.  Again,  we  assume  that  the 
observations  are  accurate  in  that,  conditional  on  making  the  observation,  the  constraints 
hold.  For  now,  we  focus  on  the  simplest  possible  case  that  cannot  be  handled  by  Jeffrey 
updating.  In  this  case,  constraints  (observations)  still  have  the  form  a\U\\ . . .  ;anUn,  but 
now  the  17,; ’s  do  not  have  to  form  a  partition  (they  may  overlap  and/or  not  cover  W)  and 
the  a,  do  not  have  to  sum  to  1.  Such  an  observation  is  accurate  if  it  satisfies  (5),  just  as 
before. 

We  can  now  ask  the  same  questions  that  we  asked  before  about  ordinary  conditioning 
and  Jeffrey  conditioning  in  the  naive  space. 


1.  Is  there  an  alternative  characterization  of  the  conditions  under  which  MRE  updating 
coincides  with  conditioning  in  the  sophisticated  space?  That  is,  are  there  analogues 
of  Theorem  3.1  and  Theorem  5.1  for  MRE  updating? 

2.  Are  there  combinations  of  O  and  W  for  which  it  is  not  even  possible  that  MRE  can 
coincide  with  conditioning  in  the  sophisticated  space? 


With  regard  to  question  1,  it  is  easy  to  provide  a  counterexample  showing  that  there  is  no 
obvious  analogue  to  Theorem  5.1  for  MRE.  There  is  a  constraint  C  such  that  the  condition  of 
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part  (a)  of  Theorem  5.1  holds  for  MRE  updating  whereas  part  (b)  does  not  hold.  (We  omit 
the  details  here.)  Of  course,  it  is  possible  that  there  are  some  quite  different  conditions  that 
characterize  when  MRE  updating  coincides  with  conditioning  in  the  sophisticated  space. 
However,  even  if  they  exist,  such  conditions  may  be  uninteresting  in  that  they  may  hardly 
ever  apply.  Indeed,  as  a  partial  answer  to  question  2,  we  now  introduce  a  very  simple 
setting  in  which  MRE  updating  necessarily  leads  to  a  result  different  from  conditioning  in 
the  sophisticated  space. 

Let  U\  and  U2  be  two  subsets  of  W  such  that  V\  =  U\  —  U2  ■  V-i  =  U2  —  U\ .  I/3  =  U\  H  U2  . 
and  V4  =  W  — (C/1UC/2)  are  all  nonempty.  Consider  a  constraint  of  the  form  C  =  a\U\\  (X2U2, 
where  a\,  0L2  are  both  in  (0, 1).  We  investigate  what  happens  if  we  use  MRE  updating  on  C. 
Since  U\  and  U2  overlap  and  do  not  cover  the  space,  in  general  Jeffrey  conditioning  cannot 
be  applied  to  update  on  C.  There  are  some  situations  where,  despite  the  overlap,  Jeffrey 
conditioning  can  essentially  be  applied.  We  say  that  observation  C  =  a\Ui'.  (X2U2  is  Jeffrey- 
like  iff,  after  MRE  updating  on  one  of  the  constraints  a\U\  or  a^C/a,  the  other  constraint 
holds  as  well.  That  is,  C  is  Jeffrey-like  (with  respect  to  Pw)  if  either  PwffJ2  \  ci\U\)  =  0:2 
or  Pw ( U  1  |  0:2^2)  =  ol i-  Suppose  that  Pw{U2  \&iUi)  =  a 2;  then  it  is  easy  to  show  that 

Pw{-  I  oiiUi)  =  Pw(- !  aiUy,  (X2U2) ■ 

Intuitively,  if  the  “closest”  distribution  P'w  to  Pw  that  satisfies  P'w  ( U\ )  =  aq  also 
satisfies  P'w  ( U 2 )  =  a 2,  then  P'w  is  the  closest  distribution  to  Pw  that  satisfies  the  constraint 
C  =  aiUi]  oqC/j-  Note  that  MRE  updating  on  aU  is  equivalent  to  Jeffrey  conditioning  on 
aU]  (1  —  a)(W  —  U).  Thus,  if  C  is  Jeffrey-like,  then  updating  with  C  is  equivalent  to  Jeffrey 
updating. 


Theorem  5.3:  Given  a  set  1Z  of  runs  and  a  set  O  =  { C\ .  C'2 }  of  observations,  where 
Ci  =  OLiiU\\ai2U2,  for  i  =  1,2,  let  Pr  be  a  distribution  on  1Z  such  that  Pr(Jfo  =  Ci), 
Pr(Xo  =  C'2)  >  0,  and  Pr^(uj)  =  Vx{Xw  =  w)  >  0  for  all  w  G  W.  Let  Pr*  =  Pr(- 1  Xq  = 
Cf),  and  letPr'w  be  the  marginal  of  Pr*  on  W.  If  either  C\  or  C'2  is  not  Jeffrey-like,  then 
we  cannot  have  PPW  =  Pr^vO  |  Ci),  for  both  i  =  1,2. 


For  fixed  U\  and  U2,  we  can  identify  an  observation  a\ U\ ;  a 2 U2  with  the  pair  (01,02)  G 
(0,1)2.  Under  our  conditions  on  U\  and  C/2,  the  set  of  all  Jeffrey- like  observations  is  a 
subset  of  0  (Lebesgue)  measure  of  this  set.  Thus,  the  set  of  observations  for  which  MRE 
conditioning  corresponds  to  conditioning  in  the  sophisticated  space  is  a  (Lebesgue)  measure 
0  set  in  the  space  of  possible  observations.  Note  however,  that  this  set  depends  on  the  prior 
Pw  over  W . 

A  result  similar  to  Theorem  5.3  was  proved  by  Seidenfeld  (1986)  (and  considerably 
generalized  by  Dawid  (2001)).  Seidenfeld  shows  that,  under  very  weak  conditions,  MRE 
updating  cannot  coincide  with  sophisticated  conditioning  if  the  observations  have  the  form 
“the  conditional  probability  of  U  given  V  is  a”  (as  is  the  case  in  the  Judy  Benjamin 
problem).  Theorem  5.3  shows  that  this  is  impossible  even  for  observations  of  the  much 
simpler  form  a\  U\ ;  a 2  U2 ,  unless  we  can  reduce  the  problem  to  Jeffrey  conditioning  (in 
which  case  Theorem  5.1  applies). 
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conditioning 
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partition 
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conditioning 
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vector 

probabilities  of  two 
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if  both  observations  Jeffrey- 
like  (Theorem  5.3) 

Figure  1:  Conditions  under  which  updating  in  the  naive  space  coincides  with  conditioning 
in  the  sophisticated  space. 


6.  Discussion 

We  have  studied  the  circumstances  under  which  ordinary  conditioning,  Jeffrey  conditioning, 
and  MRE  updating  in  a  naive  space  can  be  justified,  where  “justified”  for  us  means  “agrees 
with  conditioning  in  the  sophisticated  space”.  The  main  message  of  this  paper  is  that, 
except  for  quite  special  cases,  the  three  methods  cannot  be  justified.  Figure  1  summarizes 
the  main  insights  of  this  paper  in  more  detail. 

As  we  mentioned  in  the  introduction,  the  idea  of  comparing  an  update  rule  in  a  “naive 
space”  with  conditioning  in  a  “sophisticated  space”  is  not  new;  it  appears  in  the  CAR 
literature  and  the  MRE  literature  (as  well  as  in  papers  such  as  (Halpern  &  Tuttle,  1993) 
and  (Dawid  &  Dickey,  1977)).  In  addition  to  bringing  these  two  strands  of  research  together, 
our  own  contributions  are  the  following:  (a)  we  show  that  the  CAR  framework  can  be  used 
as  a  general  tool  to  clarify  many  of  the  well-known  paradoxes  of  conditional  probability; 
(b)  we  give  a  general  characterization  of  CAR  in  terms  of  a  binary-valued  matrix,  showing 
that  in  many  realistic  scenarios,  the  CAR  condition  cannot  hold  (Theorem  4.4);  (c)  we 
define  a  mechanism  CARgen*  that  generates  all  and  only  distributions  satisfying  CAR 
(Theorem  4.9);  (d)  we  show  that  the  CAR  condition  has  a  natural  extension  to  cases  where 
Jeffrey  conditioning  can  be  applied  (Theorem  5.1);  and  (e)  we  show  that  no  CAR-like 
condition  can  hold  in  general  for  cases  where  only  MRE  (and  not  Jeffrey)  updating  can  be 
applied  (Theorem  5.3). 

Our  results  suggest  that  working  in  the  naive  space  is  rather  problematic.  On  the  other 
hand,  as  we  observed  in  the  introduction,  working  in  the  sophisticated  space  (even  assuming 
it  can  be  constructed)  is  problematic  too.  So  what  are  the  alternatives? 

For  one  thing,  it  is  worth  observing  that  AIRE  updating  is  not  always  so  bad.  In 
many  successful  practical  applications,  the  “constraint”  on  which  to  update  is  of  the  form 
^  Ya= i  Aj  =  t  for  some  large  n ,  where  X,t  is  the  ith  outcome  of  a  random  variable  X  on 
W .  That  is,  we  observe  an  empirical  average  of  outcomes  of  X.  In  such  a  case,  the  AIRE 
distribution  is  “close”  (in  the  appropriate  distance  measure)  to  the  distribution  we  arrive 


264 


Updating  Probabilities 


at  by  sophisticated  conditioning.  That  is,  if  Prw  =  Pr^(-  j  E(X)  =  t).  Pr7  =  Pr(- 1  Xo  =< 
-  J2i=iXi  =  t)),  and  Qn  denotes  the  n-fold  product  of  a  probability  distribution  Q ,  then 
for  sufficiently  large  n.  we  have  that  (Pi")n  ss  (Pr^)"  (van  Campenhout  &  Cover,  1981; 
Griinwald,  2001;  Skyrms,  1985;  Uffink,  1996).  Thus,  in  such  cases  MRE  (almost)  coincides 
with  sophisticated  conditioning  after  all.  (See  (Dawid,  2001)  for  a  discussion  of  how  this 
result  can  be  reconciled  with  the  results  of  Section  5.) 

But  when  this  special  situation  does  not  apply,  it  is  worth  asking  whether  there  exists  an 
approach  for  updating  in  the  naive  space  that  can  be  easily  applied  in  practical  situations, 
yet  leads  to  better ,  in  some  formally  provable  sense,  updated  distributions  than  the  methods 
we  have  considered?  A  very  interesting  candidate,  often  informally  applied  by  human  agents, 
is  to  simply  ignore  the  available  extra  information.  It  turns  out  that  there  are  situations 
where  this  update  rule  behaves  better,  in  a  precise  sense,  than  the  three  methods  we  have 
considered.  This  will  be  explored  in  future  work. 

Another  issue  that  needs  further  exploration  is  the  subtle  distinction  between  “weak” 
and  “strong”  CAR,  which  was  brought  to  our  attention  by  Manfred  Jaeger.  The  terminology 
is  due  to  James  Robins,  who  only  very  recently  discovered  the  distinction.  It  turns  out 
that  the  notion  of  CAR  we  use  in  this  paper  (“weak”  CAR)  is  slightly  different  from  the 
“strong’  version  of  CAR  used  by  Gill  et  al.  (1997).  They  view  the  CAR  assumption  quite 
literallly  as  an  assumption  about  a  “coarsening  process”.  Adjusted  to  our  notation,  they 
write  (Gill  et  al.,  1997,  page  260):  “Firstly  the  random  variable  Xw  of  interest  is  realized; 
secondly,  a  conceptually  different  process  (usually  associated  with  features  of  measurement 
or  observational  restrictions,  rather  than  the  scientific  phenomenon  under  study  itself), 
given  the  value  w  taken  by  X\y,  replaces  this  value  by  a  set  U  such  that  w  G  17.”  Thus,  in 
their  view,  the  distribution  on  1Z  is  constructed  from  a  distribution  on  W  and  a  (conceptually 
unrelated)  set  of  conditional  distributions  Pr(Ao  =  •  j  X\y  =  w) ,  one  for  each  w  G  W .  This 
implies  that  Pr(Ao  =  •  |  X\y  =  w )  is  a  well-defined  number  even  if  Pr(Aw  =  w )  =  0. 
Gill,  van  der  Laan,  and  Robins  (1997)  then  define  the  CAR  condition  as  “for  all  U  G  O. 
Pr(Ao  =  U\Xw  =  w )  is  constant  in  w  €  17”.  This  is  just  part  (d)  of  Theorem  3.1, 
which  now  has  to  hold  even  if  Pr (Xw  =  w )  =  0.  Thus,  the  set  of  distributions  satisfying 
strong  CAR  is  a  subset  of  the  set  satisfying  weak  CAR.  Robins  and  Jaeger  show  that  the 
inclusion  can  be  strict  and  that  this  can  have  substantial  consequences.  Therefore,  our 
characterizations  of  CAR  in  Section  4  apply  only  to  weak  CAR,  and  further  research  is 
needed  to  see  the  extent  to  which  they  also  apply  to  strong  CAR. 

Our  discussion  here  has  focused  completely  on  the  probabilistic  case.  However,  these 
questions  also  make  sense  for  other  representations  of  uncertainty.  Interestingly,  Friedman 
and  Halpern  (1999)  show  that  that  AGM-style  belief  revision  (Alchourron,  Gardenfors,  & 
Makinson,  1985)  can  be  represented  in  terms  of  conditioning  using  a  qualitative  represen¬ 
tation  of  uncertainty  called  a  plausibility  measure ;  to  do  this,  the  plausibility  measure  must 
satisfy  the  analogue  of  Theorem  3.1(a),  so  that  observations  carry  no  more  information 
than  the  fact  that  they  are  true.  No  CAR-like  condition  is  given  to  guarantee  that  this 
condition  holds  for  plausibility  measures  though.  It  would  be  interesting  to  know  if  there 
are  analogues  to  CAR  for  other  representations  of  uncertainty,  such  as  possibility  measures 
(Dubois  &  Prade,  1990)  or  belief  functions  (Shafer,  1976). 
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Appendix  A.  Proofs 

In  this  section,  we  provide  the  proofs  of  all  the  results  in  the  paper.  For  convenience,  we 
restate  the  results  here. 

Theorem  3.1:  Fix  a  probability  Pr  on  TZ  and  a  set  U  C  W.  The  following  are  equivalent: 

(a)  If  Pr(A0  =  U)>  0,  then  Pr(Xw  =  w  \XQ  =  U)  =  Pr{Xw  =w\Xw€U)  for  all 
w  e  U . 

(b)  The  event  Xw  =  w  is  independent  of  the  event  Xq  =  U  given  Xw  G  U,  for  all  w  G  U . 

(c)  Pr (Xo  =  U  \  Xw  —  w)  =  Pr (Xo  =  U  j  Xw  G  U)  for  all  w  G  U  such  that  Pr(Xw  = 
w)  >  0. 

(d)  Pr(Ao  =  U  \  X\\r  =  w )  =  Pr(Ao  =  U  \  Xw  =  w ')  for  all  w,w'  G  U  such  that 
Pr(Aw  =  w)  >  0  and  Pr(Xw  =  w')  >  0. 

Proof:  Suppose  (a)  holds.  We  want  to  show  that  Xw  =  w  and  Xq  =  U  are  independent, 
for  all  w  G  U.  Fix  w  G  U.  If  Pr(Ao  =  17)  =  0  then  the  events  are  trivially  independent.  So 
suppose  that  Pi{Xo  =  U)  >  0.  Clearly 

Pr{Xw  =  w  |  A0  =  U  0  Xw  G  U)  =  Pr{Xw  =  w\  X0  =  U) 

(since  observing  U  implies  that  the  true  world  is  in  U).  By  part  (a), 

Pr(A'u  =  w  Xq  U)  ?  Pr(A'|t  =  w  \  Xw  G  U). 


Thus, 

Pr(Xw  =w\Xu  =  Ur  Xw  G  U)  =  Pr(Xw  =  w  \XW€U), 
showing  that  Xw  =  w  is  independent  of  Xq  =  U,  given  Xw  G  U. 
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Next  suppose  that  (b)  holds,  and  w  G  U  is  such  that  Pr(Xw  =  w)  >  0.  From  part  (b) 
it  is  immediate  that  Pr(Xo  =  U  j  Xw  =  wDXw  G  U)  =  Pr(Xo  =  U  \  Xw  G  U ).  Moreover, 
since  w  G  U,  clearly  Pr  (Xo  =  U  \  X\y  =  w  D  Xw  €  U)  =  Pr(Xo  =  U  \  Xw  =  w).  Part  (c) 
now  follows. 

Clearly  (d)  follows  immediately  from  (c).  Thus,  it  remains  to  show  that  (a)  follows  from 
(d).  We  do  this  by  showing  that  (d)  implies  (c)  and  that  (c)  implies  (a).  So  suppose  that  (d) 
holds.  Suppose  that  Pr(Xo  =  U  \  Xw  =  w)  =  a  for  all  w  G  U  such  that  Pi'(Xyy  =  w)  >  0. 
From  the  definition  of  conditional  probability 

Pr(Xc  =  U\XW  £U) 

=  T.{weU: Pr(Xw=w)>0}  Pl'(^0  =  U  D  XW  =  w)/Pl(XW  G  U ) 

=  E{«,ec/:Pr(AV=«,)>0}  Pr(^0  =  u  \Xw  =  w)  Pi-{XW  =  w)/ Pi-{XW  G  U) 

=  E{weU: Pr(AV=«,)>0}  a  Pt(XW  =  w)f  P?{XW  G  U) 

=  a 

Thus,  (c)  follows  from  (d). 

Finally,  to  see  that  (a)  follows  from  (c),  suppose  that  (c)  holds.  If  w  G  U  is  such  that 
Pr(Ww  =  w)  =  0,  then  (a)  is  immediate,  so  suppose  that  Pi{Xw  =  w)  >  0.  Then,  using 
(c)  and  the  fact  that  Xo  1/  C  Xyy  G  U,  we  have  that 

Pr(Xw  =  w\X0  =  U ) 

=  Pr(X0  =  U\XW=  w)  Pr(Xw  =  w )  /  Pr(X0  =  U ) 

=  Pr(X0  =  U  j  Xw  G  U)  Pr{Xw  =  w)  /  Pr(X0  =  U) 

=  Pr  (X0  •  U  n  Xu  G  U)  Pr(Xw  =w)f  Pr  {Xw  G  U )  Pr(XG  =  U) 

=  Pr(X0  =  U)  Pr(X^  =  w) /  Pr(Xw  G  U)  Pr(X0  =  U) 

=  Pr(Xu  =  w) /  Pr(Xu  G  U) 

=  Pr(X  =  w  |  Xw  G  17), 

as  desired.  | 

Proposition  4.1:  The  CAR  condition  holds  for  all  distributions  Pr  on  1Z  if  and  only  if 
O  consists  of  pairwise  disjoint  subsets  ofW . 

Proof:  First  suppose  that  the  sets  in  O  are  pairwise  disjoint.  Then  for  each  probability 
distribution  Pr  on  7Z.  each  U  G  O,  and  each  world  w  G  U  such  that  Pr(Xy/  =  w )  >  0,  it 

must  be  the  case  that  Pr(Xo  =  U  \  Xyy  =  w )  =  1.  Thus,  part  (d)  of  Theorem  3.1  applies. 

For  the  converse,  suppose  that  the  sets  in  O  are  not  pairwise  disjoint.  Then  there  exist 
sets  U.  U1  G  O  such  that  both  U  —  U1  and  U  fl  U1  are  nonempty.  Let  wq  G  U  fl  U' .  Clearly 
there  exists  a  distribution  Pr  on  7Z  such  that  Pr(Xo  =  U)  >  0,  Pr (Xq  =  U')  >  0,  Pi{X\y  = 
wo  |  Xo  =  U)  =  0,Pr(Xrv  =  w  o  |  Xq  =  U')  >  0.  But  then  Pr  {Xw  =  w  o  |  Xw  G  17)  >  0. 
Thus 

Pr  (Xw-  =  wo  |  Xo  =  17 )  ^  Pr  {Xw  =  w0  \  Xw  G  17), 
and  the  CAR  condition  (part  (a)  of  Theorem  3.1)  is  violated.  | 

Lemma  4.3:  Let  1Z  be  the  set  of  runs  over  observations  O  and  worlds  W,  and  let  S  be 
the  CARacterizing  matrix  for  O  and  W. 
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(a)  Let  Pr  be  any  distribution  over  IZ  and  let  S'  be  the  matrix  obtained  by  deleting  from 

S  all  rows  corresponding  to  an  atom  A  with  Pr(Xw  G  A)  =  0.  Define  the  vector 
7  =  (71, . . .  ,7 „)  by  setting  7 j  =  Pr(Xo  =  Uj  \  Xw  G  Ufi)  if  Pr(Xw  G  Uj)  >  0,  and 
7 j  =  0  otherwise,  for  j  =  1, . . .  ,n.  If  Pr  satisfies  CAR,  then  S'  ■  =  l1 . 

(b)  Let  S'  be  a  matrix  consisting  of  a  subset  of  the  rows  of  S,  and  let  Vw,S'  be  the  set  of 
distributions  over  W  with  support  corresponding  to  S' ;  i.e., 

Vw,S'  =  {Pw  I  Pw(A)  >  0  iff  A  corresponds  to  a  row  in  S'}. 

If  there  exists  a  vector  7  >  0  such  that  S'  ■  fiT  —  \T ,  then,  for  all  Pw  G  Vw,S'  ■  there 
exists  a  distribution  Pr  over  IZ  with  Pr w  =  Pw  (i-e.,  the  marginal  of  Pr  on  W  is 
Pw)  such  that  (a)  Pr  satisfies  CAR  and  (b)  Pr(Ao  =  Uj  j  Xw  G  Uj)  =  7 j  for  all  j 
with  Pr  (Xw  G  Uj)  >  0. 

Proof:  For  part  (a),  suppose  that  Pr  is  a  distribution  on  IZ  that  satisfies  CAR.  Let  k  be 
the  number  of  rows  in  S' ,  and  let  cp;  =  Pr  (Xw  G  A.f),  for  i  =  1, . . .  ,  k,  where  A. j  is  the  atom 
corresponding  to  the  zth  row  of  S'.  Note  that  >  0  for  i  =  1, . . .  ,  k.  Clearly, 

E  Pl'(^0  =  Uj  I  G  A)  =  1.  (6) 

Ur-C  r,\ 


It  easily  follows  from  the  CAR  condition  that 

Pr(A0  =  Uj  |  Xw  G  A)  =  Pr(XG  =  Uj  \  Xw  G  Uj) 
for  all  A  ^  Uj,  so  (6)  is  equivalent  to 

E  Pl'(X0  =  u3  \  xw  G  Uj)  =  1.  (7) 

{r-AiOJj} 

(7)  implies  that  Y){j:Ai<zUj}  7i  =  1  for  i  =  1, . . .  ,  k.  Let  Sj  be  the  row  in  S'  corresponding 
to  A.  Since  s-i  has  a  1  as  its  jth  component  if  A  ^  Uj  and  a  0  otherwise,  it  follows  that 
s. i  ■  7r  =  1  and  hence  S'  ■  =  1T . 

For  part  (b),  let  k  be  the  number  of  rows  in  S' ,  let  s*i, . . .  ,  A  be  the  rows  of  S',  and 
let  /l], . . . ,  A  be  the  corresponding  atoms.  Fix  Pw  G  Vw,s,  &nd  set  ct.;  =  P^(A)  for 
i  =  1, . . . ,  k.  Let  Pr  be  the  unique  distribution  on  IZ  such  that 

Pi-{XW  G  A) 

Pr(Xw  G  A) 

Pr {X0  =  Uj  \  Xw  G  A) 

Note  that  Pr  is  indeed  a  probability  distribution  on  IZ,  since  ZAe.4  ^>1{xw  G  A)  =  1, 
Pr(Aw  G  A )  >  0  for  i  =  1, . . . ,  k,  and,  since  we  are  assuming  that  S'  ■  =  1T , 

n 

E  Pr(Ao  =  Uj  I  G  A)  =  Si  •  f  =  1, 

3= l 


a-i,  for  i  =  1, . . .  ,k, 

Oif  AeA-{A1,...,Ak}, 
if  Ai  G  Uj, 
otherwise. 


(8) 
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for  i  =  1, . . . ,  k.  Clearly  Pi'w  =  Pw-  It  remains  to  show  that  Pr  satisfies  CAR  and  that 
7 j  =  Pr(Xo  =  Uj  j  Xw  G  Uj ).  Given  j  G  {1, . . .  ,n},  suppose  that  there  exist  atoms  At,  A y 
corresponding  to  rows  s)  and  sy  of  S'  such  that  Aj .  Ay  G  Uj.  Then 

Pr(Ao  =  Uj\Xw  G  A-,)  =  Pr(A'o  =  Uj\Xw  G  Ay)  =  7 j. 

It  now  follows  by  Theorem  3.1(c)  that  Pr  satisfies  the  CAR  condition  for  U\, . . . ,  Un.  More¬ 
over,  Theorem  3.1(d),  it  must  be  the  case  that  Pr(Xo  =  Uj  \  Xw  G  Uj)  =  7 j .  | 

The  proof  of  Theorem  4.4  builds  on  Lemma  4.3  and  the  following  proposition,  which 
shows  that  the  condition  of  part  (b)  of  Theorem  4.4  is  actually  stronger  than  the  condition 
of  part  (a).  It  is  therefore  not  surprising  that  it  leads  to  a  stronger  conclusion. 

Proposition  A.l:  If  there  exists  a  subset  R  of  rows  of  S  that  is  linearly  dependent  but 
not  affinely  dependent,  then  for  all  TZ-atoms  A  corresponding  to  a  row  in  R  and  all  j*  G 
{1, . . .  ,n},  if  A  C  Uj*,  there  exists  a  vector  u  that  is  an  affine  combination  of  the  rows  in 
R  such  that  Uj  >  0  for  all  j  G  {1, . . . ,  n}  and  Uj*  >  0. 

Proof:  Suppose  that  there  exists  a  subset  R  of  rows  of  S  that  is  linearly  dependent  but 
not  affinely  dependent.  Without  loss  of  generality,  let  Pi ... .  .  vy  be  the  rows  in  R.  There 
exist  Ai, . . . ,  Afc  such  that  k  =  ffi=i  A«  7^  0  and  Ya= 1  -Vu  =  0-  We  first  show  that  in  fact 
every  row  v  in  R  is  an  affine  combination  of  the  other  rows.  Fix  some  j  G  {1, . . .  ,  k}.  Let 
hj  =  (Aj  “  Ei= 1  A*)  =  -  Y,i±j  A-i  and  let  m  =  \  for  i  f-  j.  Then  Jfi=i  hi  =  0  and 

k  k  k 

^2 1'Ai  -  A'  G  -  =  ~K^r 

i= 1  i= 1  i= 1 

For  i  =  1, . . .  ,  k,  let  /i(  =  —pi/n.  Then  h'i  =  0  and  Ya= 1  /';'V  =  Vj~  Now  if  A,;  C  Uj* 

for  some  i  =  1, . . .  ,k  and  some  j*  =  1, . . . , n,  then  If  has  a  1  as  its  j*th  component.  Also,  P) 

is  an  affine  combination  of  the  rows  of  R  with  no  negative  components,  so  ui  is  the  desired 
vector.  | 

Theorem  4.4:  Let  IZ  be  a  set  of  runs  over  observations  O  =  {U\, . . . .  Un }  and  worlds  W , 
and  let  S  be  the  CARacterizing  matrix  for  O  and  W. 

(a)  Suppose  that  there  exists  a  subset  R  of  the  rows  in  S  and  a  vector  u  =  (ui, . . .  ,  un) 

that  is  an  affine  combination  of  the  rows  of  R  such  that  Uj  >  0  for  all  j  G  {1, . . . ,  n} 

and  Uj*  >  0  for  some  j*  G  {1  Then  there  is  no  distribution  Pr  on  IZ  that 

satisfies  CAR  such  that  Pr(Ao  =  Uj*)  >  0  and  Pr (Xw  G  A)  >  0  for  each  IZ-atom  A 
corresponding  to  a  row  in  R. 

(b)  If  there  exists  a  subset  R  of  the  rows  of  S  that  is  linearly  dependent  but  not  affinely 
dependent,  then  there  is  no  distribution  Pr  on  IZ  that  satisfies  CAR  such  that  Pr  (Xw  G 
A)  >  0  for  each  IZ-atom  A  corresponding  to  a  row  in  R. 

( c )  Given  a  set  R  consisting  of  n  linearly  independent  rows  of  S  and  a  distribution  Pw 
on  W  such  that  Pw{A )  >  0  for  all  A  corresponding  to  a  row  in  R,  there  is  a  unique 
distribution  Po  on  O  such  that  if  Pr  is  a  distribution  on  IZ  satisfying  CAR  and 
Pr  {Xw  G  A)  =  Pw{A )  for  each  atom  A  corresponding  to  a  row  in  R,  then  Pi '{Xo  = 
U)  —  Pq(U). 
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Proof:  For  part  (a),  suppose  that  R  consists  of  corresponding  to  atoms 

1 .... .  /l/,-.  By  assumption,  there  exist  coefficients  Ai, . . . ,  A&  such  that  Ya= i  \  =  0’  and  a 
vector  u  =  J2i=  l  -VG  such  that  every  component  of  u  is  nonnegative.  Suppose,  by  way  of 
contradiction,  that  Pr  satisfies  CAR  and  that  ol.l  =  Pr(X^  G  Af)  >  0  for  i  G  {1, . . .  ,  k}.  By 
Lemma  4.3(a),  we  have 

(k  \  k  k 

Y  )  ■  7  =  Y  ’  7)  =  Y  A*  =  (g) 

i  .1  /  i= 1  i= 1 

where  7  is  defined  as  in  Lemma  4.3.  For  j  =  1, . . .  ,  n,  if  Pr(Ao  =  Uj )  >  0  then  Pr(Ao  = 
Uj  fl  Xw  G  Uj)  =  Pr(Ao  =  Uj)  >  0  and  Pr(Xw  €  Uj)  >  0,  so  7 j  >  0.  By  assumption, 
all  the  components  of  u  and  7  are  nonnegative.  Therefore,  if  there  exists  j*  such  that 
Pi{Xo  =  Uj*)  >  0  and  Uj*  >  0,  then  it -7  >  0.  This  contradicts  (9),  and  part  (a)  is  proved. 

For  part  (b),  suppose  that  there  exists  a  subset  R  of  rows  of  S  that  is  linearly  depen¬ 
dent  but  not  affinely  dependent.  Suppose,  by  way  of  contradiction,  that  Pr  satisfies  CAR 
and  that  Pr(X\y  G  A)  >  0  for  all  atoms  A  corresponding  to  a  row  in  R.  Pick  an  atom 
A*  corresponding  to  such  a  row.  By  Proposition  A.l  and  Theorem  4.4(a),  we  have  that 
Pr(Xo  =  Uj*)  =  0  for  all  j*  such  that  A*  G  Uj*.  But  then  Pr(Aw  G  A*)  =0,  and  we  have 
arrived  at  a  contradiction. 

For  part  (c),  suppose  that  R  consists  of  the  rows  vi,...,vn.  Let  S'  be  the  nxn 
submatrix  of  S  consisting  of  the  rows  of  R.  Since  these  rows  are  linearly  independent,  a 
standard  result  of  linear  algebra  says  that  S'  is  invertible.  Let  Pr  be  a  distribution  on  1Z 
satisfying  CAR.  By  Lemma  4.3(a),  S' 7  =  1T .  Thus,  7  =  (S")-1?.  For  j  =  1,  . . .  .  n  wo  must 
have  7 j  =  /3j/Pr(X\Y  G  Uj),  where  0j  =  Pr (Xq  =  Uj).  Given  Pr^(A)  for  each  atom  A, 
we  can  clearly  solve  for  the  /3/s.  | 

Theorem  4.9:  Given  a  set  1Z  of  runs  over  a  set  W  of  worlds  and  a  set  O  of  observations, 
Pr  is  a  distribution  on  1Z  that  satisfies  CAR  iff  there  is  a  setting  of  the  parameters  in 
CARgen*  such  that,  for  all  w  G  W  and  U  G  O,  Pr({r  :  %(r)  =  w,  Xo(r)  =  U})  is  the 
probability  that  CARgen*  returns  ( w,U ). 

Proof:  First  we  show  that  if  Pr  is  a  probability  on  1Z  such  that,  for  some  setting  of 
the  parameters  of  CARgen*,  Pr({r  :  %(r)  =  w,  X<j(r)  =  U})  is  the  probability  that 
CARgen*  returns  (■ w,U ),  then  Pr  satisfies  CAR.  By  Theorem  3.1,  it  suffices  to  show  that, 
for  each  set  U  G  O  and  worlds  w\,W2  G  U  such  that  Pi{X\,y  =  wi)  >  0  and  Pr(Aw  =  W2)  > 
0,  we  have  Pr(Ao  =  U  j  Xw  =  wi)  =  Pr(Xo  =  U  \  Xyy  =  n^)-  So  suppose  that  W\,W2  G  U , 
Pi(Xw  =  wx)  >  0,  and  Pr(Xw  =  w2)  >  0.  Let  av  =  E{neP:C/en}  p^(n)(l  -  qu\n)- 
Intuitively,  ajj  is  the  probability  that  the  algorithm  terminates  immediately  at  step  2.3 
with  (w,  U)  conditional  on  some  w  G  U  being  chosen  at  step  2.1.  Notice  for  future  reference 
that,  for  all  w. 


Y  av  =  Y  pp(n)(l  -®/|n)  =  1  “9,  (10) 

{u-.weu}  {{u,n)-.ueu,weu} 

where  q  is  defined  by  (4).  As  explained  in  the  main  text,  for  both  i  =  1,  2,  q  is  the  probability 
that  the  algorithm  does  not  terminate  at  step  2.3  given  that  wi  is  chosen  in  step  2.1.  It 
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easily  follows  that  the  probability  that  ( Wi,U )  is  output  at  step  2.3  is 

Pw(wi)au{  1  +  q  +  q2  H - )  =  Pw(wi)au/(  1  -  <?)■ 

Thus,  Pr(Xw  =  Wi  fl  Xo  =  U)  =  Pw{wi)aif/(1  —  q).  Using  (10),  we  have  that 

Pr{Xw  =  Pr (Xw  =  Wi  n  X0  =  U)  =  ^  ^  ap  =  Pw{wi)- 

{u-.wieu}  ^  {U:wiEU} 

Finally,  we  have  that  Pr(Xo  =  U  \  Xw  =  wi)  =  au/{  1  —  q),  for  i  =  1,  2.  Thus,  Pr  satisfies 
the  CAR  condition. 

For  the  converse,  suppose  that  Pr  satisfies  the  CAR  condition.  Let  O  =  {U\. . . .  ,  Un}. 
We  choose  the  parameters  for  CARgen*  as  follows.  Set  Pw{w)  =  Pr(Xw  =  w)  and  let 
fa  =  Pr(Ao  =  Uj).  Without  loss  of  generality,  we  assume  that  A  >  0  (otherwise,  take  O' 
to  consist  of  those  sets  that  are  observed  with  positive  probability,  and  do  the  proof  using 

O'). 

For  i  =  1, . . .  ,n,  let  LL;  =  {Uj,Uj}.  Set  Pp( LL;)  =  Pr(X0  =  Uj)  =  fa  and  qu. ]n.  =  1. 
(Thus,  the  set  Uj  is  always  rejected,  unless  Uj  =  Uj.)  Since  Pr(Xw  G  Uj)  >  Pr (Xq  = 
Uj)  >  0  by  assumption,  it  must  be  the  case  that  e  =  min"=1  Pr(Xw  G  Uj)  >  0.  Now  set 
=  1  —  </  Pr(A'n  €  Uj). 

We  first  show  that,  with  these  parameter  settings,  we  can  choose  q  such  that  constraint 
(4)  is  satisfied.  Let  qw  =  )C{c/,n:  weU,Ue n}  PvfaP)Qu\n-  P°r  each  w  G  W  such  that  Pw{w)  > 

0,  we  have 

Qw 

=  S{[/,IL  wEU,UEU}  Pp(H)qU\n 

=  Hi= 1  S{P:  wEU,UEUi}  pv(tti)qu\ni 

=  Y,{i:wEUi}  -Pp(n i)qUi\Ui  +  ^{i-.wEUi}  ^/’(PiO'/U;  |:it. - 

The  last  equality  follows  because  If  =  {Uj,  Uj}.  Thus,  for  a  fixed  i,  Yl{u-.  weu,  t/eip}  Pp(Ui)qu\  Ui 
is  either  P-p(Ilj)qUi\Iii  if  w  G  Ui ,  or  Pp(Hi)q^j  |n  if  w  G  Uj.  It  follows  that 


Qw 

=  52{i:wEUi}  “  e/P1 '{Xw  €  Ui))  +  Yj{i:w$Ui}  Pi  '  1 
=  52{i:wEUi}  Pr(-^0  =  Uj)(l  -e/Py{Xw  G  Uj))  +  H{v.w^Ui}  Pr(Aio  =  Ui) 

=  E"=1  Pr(Xo  =  Uj)  -  e  >:{,:i,.c(:q  Pr(Ao  =  U,  |  AW/  G  Uj) 

=  1  —  e  J2{i-.wEUi}  Pr(Xo  =  Ui  j  A n/  =  m)  [since  Pr  satisfies  CAR] 

=  1-6. 

Thus,  qw  =  qwi  if  Pw(w),  Pw{w')  >  0,  so  these  parameter  settings  are  appropriate  for 
CARgen*  (taking  q  =  qw  for  any  w  such  that  Pw{w)  >  0).  Moreover,  6=  1  —  q. 

We  now  show  that,  with  these  parameter  settings,  Pv(Xw  =  w  fl  Xq  =  U)  is  the 
probability  that  CARgen*  halts  with  (■ w,U ),  for  all  w  G  W  and  U  G  O.  Clearly  if 
Pi{Xw  =  w)  =  0,  this  is  true,  since  then  Pr(Xw  =  w  fl  Xo  =  U)  =  0,  and  the  probability 
that  CARgen*  halts  with  output  ( w,U )  is  at  most  Pw(w)  =  Pr(Xw  =  w)  =  0.  So 
suppose  that  Pr(Xw  =  w)  >  0.  Then  it  suffices  to  show  that  Pr(Ao  =  Uj  j  Xw  =  w)  is  the 
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probability  that  (w.  Ui)  is  output,  given  that  w  is  chosen  at  the  first  step.  But  the  argument 
of  the  first  half  of  the  proof  shows  that  this  probability  is  just  But 

aUj 

1-9 

=  [since  e  =  1  —  q] 

_  S{ne-p:c/jen}  ■Ppfnhl— <3c/j|n) 

=  ft  (e/ PRY  weUj)) 

=  Pr(XG  =  Ui)/ PT(Xw€Ui) 

=  Pr(Ao  =  Ui  |  Xw  =  w)  [since  Pr  satisfies  CAR], 

as  desired.  | 

Theorem  5.1:  Fix  a  probability  Pr  on  7 Z,  a  partition  {U\, . . .  ,  Un}  ofW,  and  probabilities 
ai,...,an  such  that  ol\  +  •  •  •  +  an  =  1.  Let  C  be  the  observation  a.\U\\ . . . ;  anUn .  Fix  some 
i  G  {1, . . .  ,n}.  Then  the  following  are  equivalent: 

(a)  If  Pr(Ao  =  C)  >  0,  then  Px(Xw  =  w  \  Xo  =  C)  —  \  a\Ui\ . . . ;  anUn )  for  all 

w  €  Ui . 

(b)  Pr(Xo  =  C  |  X\y  =  w)  =  Pr(Ao  =  C  \  Xw  G  Uf)  for  all  w  €  Ui  such  that  Pr(Aw  = 
w)  >  0. 

Proof:  The  proof  is  similar  in  spirit  to  that  of  Theorem  3.1.  Suppose  that  (a)  holds, 
w  G  Ui,  and  Pr(Xw  —  w )  >  0.  Then 

Pr(X0  =  C  \Xw  =  w) 

=  Pr(A'u  =w\  X0  =  C)  Pr(A'0  =  C)f  Pr(A'u  =  w) 

=  Pr w(w  |  aiUy, . . . ;  anUn)  Pr(X0  =  C)/Vi{Xw  =  w) 

=  a-i  Pr w  (■ w  |  Ui )  Pr  (X0  =  C)f  Vrw  ( w ) 

=  aiPr(Xo  =  C)/Prw(Ui) 

Similarly, 

Pr(A0  =  C  |  Xw  G  C7i) 

=  Pr(Xw  G  Ui  \  X0  =  C)  Pr(A0  =  C)f  Pr(Xw  G  Ui) 

=  Hw'eUi  Pr 'w{w'  |  a\U\: . . . ;  a„f7„)  Pr(A0  =  C)/ Pr{Xw  G  17*) 

=  |  U{)  Pr(A0  =  C)/  Pr(A'u  G  (/*) 

=  a*Pr(X0  =  C)/Prw(17*) 

Thus,  Pr(Ao  =  C  \  Xw  =  w)  =  Pr(Ao  =  C  \  Xw  G  Ui)  for  all  w  G  Ui  such  that  Pr  {Xw  = 
w)  >  0. 

For  the  converse,  suppose  that  (b)  holds  and  Pr(Ao  =  C)  >  0.  Given  w  G  Ui,  if 
Pr  (Xw  =  w)  =  0,  then  (a)  trivially  holds,  so  suppose  that  Pr  (r(Xw  =  w)  >  0.  Suppose 
that  w  G  Ui .  Clearly  Pr(u;  |  a\U\\ . . . ;  anUn)  =  a*  Pi'w('R;  |  Ui).  Now,  using  (b),  we  have 
that 

Pi :{Xw  =  w  \X0  =  C) 

=  Pr(X0  =  C  |  Xw  =  w)  Pr (Xw  =  w) /  Pr(X0  =  C) 

=  Pr(X0  =  C  \XW  G  Ui)  Pi(Xw  =  w) /  Pr(A0  =  C) 

=  Pr {Xw  G  Ui  |  X0  =  C)  Pi(Xw  =  w) /  Pr {Xw  G  Ui) 

=  OLi  Pxw(w  I  Ui)  [using  (5)]. 
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Thus,  (a)  holds.  | 

Proposition  5.2:  Consider  a  partition  {U\ . . . . ,  Un }  ofW  and  a  set  of  k  >  1  observations 
O  =  {Ci, . . . ,  Ck}  with  Ci  =  anUi] . . . ;  OLinUn  such  that  all  a-ij  >  0.  For  every  distribution 
Po  on  O  with  Po(Ci)  >  0  for  all  i  G  {1, . . .  ,  k},  there  exists  a  distribution  Pr  on  1Z  such 
that  Po  =  Pro  (he.  Po  is  the  marginal  of  Pr  on  O)  and  Pr  satisfies  the  generalized  CAR 
condition  (part  (b)  of  Theorem  5.1)  for  U\, . . .  ,Un. 

Proof:  Given  a  set  W  of  worlds,  a  set  O  =  {C\ , . . .  ,  C). }  of  observations  with  distribution 
Po  satisfying  Po{Cj)  >  0  for  i  G  {1,...  ,k},  and  arbitrary  distributions  Pi'j  on  Uj,  j  = 
1 ... . . .  ri.  we  explicitly  construct  a  prior  Pr  on  1Z  that  satisfies  CAR  such  that  Po  =  Pro, 

where  Pro  is  the  marginal  of  Pr  on  O  and  Pr?  =  Pi  n  (•  j  Uj). 

Given  w  G  Uj,  define 

Pr({r  G  1Z  :  X0(r )  =  Ci,Xw(r )  =  w})  =  P0(Ci)aijPrj(w). 

(How  the  probability  is  split  up  over  all  the  runs  r  such  that  Xo(r)  =  Ci  and  Ijf(r)  =  w 
is  irrelevant.)  It  remains  to  check  that  Pr  is  a  distribution  on  1Z  and  that  it  satisfies  all  the 
requirements.  It  is  easy  to  check  that 

n 

Pr (Xo  =  a)  =  E  E  PoiCiHjPijiw)  =  PoiCi). 

j= 1  w€Uj 

It  follows  that  J2i= l  Pr(Xo  =  Ci)  =  1,  showing  that  Pr  is  a  probability  measure  and  Po  is 
the  marginal  of  Pr  on  O.  If  w  G  Uj,  then 

Pr W(w  j  Uj)  =  PTW(w)/Piw(Uj) 
E^e^.EtiPro(G)^Po(«u 

_  Pr(w)  Ei=i  p*o(Ci)aij 

=  Pr  j(w). 

Finally,  note  that,  for  j  G  {1, . . .  ,n},  for  all  w  G  Uj  such  that  Pr(Xw  =  w)  >  0,  we  have 
that 

PrfAo  =  C,  |  =  w) 

_  PiQ^Cpajj  Pr j(w) 

P  ro(Cj)«ij 

Pr o(Ci)aij  Pr(Xw EUj) 

E\,  P ro(Ci)«ij  Pr {XwEUj) 

Pr(Xo=CinxweUj) 

~  Pr  (XwEUj) 

=  Pr(XG  =  Ci  |  Xw  G  Uj) 
so  the  generalized  CAR  condition  holds  for  {U\, . . .  ,  Un}.  | 

To  prove  Theorem  5.3,  we  first  need  some  background  on  minimum  relative  entropy 
distributions.  Fix  some  space  W  and  let  U\, . . .  ,Un  be  subsets  of  W .  Let  A  be  the  set  of 
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(«i,  . . .  ,  on)  for  which  there  exists  some  distribution  Pw  with  P\y(Up  =  on  for  i  =  1, . . .  ,n 
and  Pw{w)  >  0  for  all  w  G  W.  Now  let  Pw  be  a  distribution  with  P\y(w)  >  0  for  all 
w  G  W.  Given  a  vector  P  =  (P\, . . .  ,Pn)  G  R” .  let 

P'l{w)  =  (u4+...+pniunMpw(y}j 

Zj 

where  1  u  is  the  indicator  function,  i.e.  1  u(w)  =  1  if  w  G  U  and  0  otherwise,  and  Z  = 

J2wew  el3llui(w'>+'"+l3nlUn(w'1  Pw(w)  is  a  normalization  factor.  Let  a*  =  pfv(Up  for  i  = 
1 ,n.  By  (Csiszar,  1975,  Theorems  2.1  and  3.1),  it  follows  that 

Pw(- 1  aiUn  <*?$anUn)  =  Pw’,  (11) 

Moreover,  for  each  vector  (ai, . . .  ,an)  G  A,  there  is  a  vector  P  =  (pi. . . . , /3n)  G  R"  such 
that  (11)  holds.  (For  an  informal  and  easy  derivation  of  (11),  see  (Cover  &  Thomas,  1991, 
Chapter  9).) 

Lemma  A. 2:  Let  C  =  ait/i; . . . ;  ctnUn  for  some  (ai, . . .  ,  an)  G  A.  Let  (Pi, . . . ,  Pn )  be  a 
vector  such  that  (11)  holds  for  a  i, . . .  ,an.  If  Pi  =  0  for  some  i  G  {1, . . .  ,n},  then 

Pw(Ui  |  ol\Ui',  . . . ;  p  . . . ;  ocnUn )  =  ol\. 

Proof:  Without  loss  of  generality,  assume  that  Pi  =  0.  Taking  a)  =  P^r(Up  for  i  =  2, . . . ,  n, 
it  follows  from  (11)  that 

Pw(w  |  a2 U-2 ; . . . ;  a'nU'n)  = 

so  that 

Pw( ■  I  cnUfi . . . ;  anUn)  =  Pw(-  \  a^U^,  ■■■;  oinUn). 

Since  P\y(Ui  \  a'2U2 ; . . . ;  ot'nUn)  =  a)  and  Pw(Ui  \  ociUi\  ■  ■  ■  anUn)  =  a-i  for  i  =  2, . . . ,  n,  we 
have  that  cp;  =  a(  for  i  2,.. . .  .  n.  Thus,  Pwp  \  &2U2]  ■  ■  ■ ;  ctnUn)  =  Pw('  \  «i Cp  . . . ;  anUn ) 
and,  in  particular, 

«i  =  Pw(Ui  |  aif/i; . . . ;  ant/n)  =  Pw(U\  I  ■  •  • ;  «nt4). 


I 

Theorem  5.3:  Given  a  set  TZ  of  runs  and  a  set  O  =  {C*i ,  C2 }  of  observations ,  where 
Ci  =  otnUi',  ai2U2,  for  i  =  1,2,  let  Pr  be  a  distribution  on  TZ  such  that  Pr(Ao  =  Ci), 
Pr(Ao  =  C2)  >  0,  and  Prw'(rr)  =  Pr (Xw  =  w)  >  0  for  all  w  G  W.  Let  Pr'  =  Pr(-  j  Xo  = 
Cp,  and  let  Pr^  be  the  marginal  of  Pr'  on  W.  If  either  Ci  or  C2  is  not  Jeffrey-like,  then 
we  cannot  have  Pr^  =  Pr^(- 1  Cp,  for  both  i  =  1,2. 

Proof:  Let  Vi  =  Ux  -  U2,  V2  =  U2  -  Ux,  V3  =  Ux  n  U2,  and  V4  =  W  -  (Ui  U  U2).  Since 
Vi,  V2,  V;i,  V4  are  all  assumed  to  be  nonempty,  we  have  A  =  (0,  l)2 ,  where  A  is  defined  as 
above,  that  is,  A  is  the  set  (01,0:2)  such  that  there  exists  a  distribution  Pw  with  Pw(Ui)  = 
ai,Pw(U2)  =  02,  Pw(w)  >  0  for  all  w  G  W .  If  Pr|^  =  Pr^(- 1  Cp  for  i  =  1,2,  then 

APrw(-|Ci)  +  (1  -  A)P tw{-  I  C2)  =  Prw,  (12) 
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where  A  =  Pr(Ao  =  C\).  We  prove  the  theorem  by  showing  that  (12)  cannot  hold  if  either 
Ci  or  C2  is  not  Jeffrey- like.  Since  we  have  assumed  that  {an,  a^)  G  (0,  l)2  =  A  for  i  =  1,2, 
we  can  apply  (11)  to  Ci  for  i  =  1,2.  Thus,  there  are  vectors  (0n,  Pa)  G  R2  for  i  =  1,  2  such 
that,  for  all  w  G  W, 

Pru  («•  C))  =  i-e^llc,i+^2lt,2Prly(w).  (13) 


(13)  implies  PrwWCi)  =  Zr'eP*  Prw(Vi),  Prw(V2\Ci)  =  Zr1^  Prw(V2),  Pr^(P3|a)  = 
1  e; 1  '  ’>iU  lh'iv  (V;j).  Piw (V/i\Ci)  =  Z^PxwiVi)-  Plugging  this  into  (12),  we  obtain  the 


following  four  equations: 

PP  n 

£021 

PrW(Vi)  = 

A— Pr^(Pi)  +  (1 

Zi 

-a»z2 

Pl-W'(Pl) 

p/^12 

p0  22 

P  rw(V2)  = 

A— Pvw{V2)  +  (1 

-A)z2 

Pr  w(V2) 

pPll+021 

P021+022 

PiwCEs)  = 

X  1  Zl 

-PrW(V3)  +  (1  -  A) 

z2  Pr 

Pi -w{V4)  = 

X^-Prw 

Z\ 

04)  +  (1  - 

■  A)iPl' 

'w{Va)- 

(14) 


Since  we  have  assumed  that  Pr(uj)  >  0  for  all  w  G  W ,  it  must  be  the  case  that  Pr w(Vi)  >  0, 
for  i  =  1,...,4.  Thus,  Pr(V))  factors  out  of  the  zth  equation  above.  By  the  change  of 
variables  fi  =  \/Z\ ,  1  //  - .( 1  A)  jZ2  ,  tij  =  -  1  and  some  rewriting,  we  see  that  (14) 

is  equivalent  to 

0  =  fien  +  (1  -  n)e2 1 

0  =  /iCi2  +  (1  —  m)£22 

0  =  /i(en  +  ei2  +  £11^12)  +  (1  —  h){e2l  +  e22  +  e2l£22)-  (15) 


If,  for  some  i,  both  t,L\  and  e*2  are  nonzero,  then  the  three  equations  of  (15)  have  no 
solutions  for  [i  G  (0, 1).  Equivalently,  if  for  some  i,  both  /%  and  (3i2  are  nonzero,  then  the 
four  equations  of  (14)  have  no  solutions  for  A  G  (0, 1).  So  it  only  remains  to  show  that  for 
some  i,  both  /%  and  (3i2  are  nonzero.  To  see  this,  note  that  by  assumption  for  some  i,  Ci 
is  not  Jeffrey-like.  But  then  it  follows  from  Lemma  A. 2  above  that  both  fin  and  On  are 
nonzero.  Thus,  the  theorem  is  proved.  | 
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